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Abstract 

Recent  developments  in  scientific  computing  have  prompted  the  need  for  supercom¬ 
puter  graphics  research.  These  developments  include  the  requirement  to  visualize  large 
amounts  of  data  which  are  processed  by  the  supercomputer,  preferably  by  real-time  image 
generation.  Unfortunately,  most  currently  used  graphics  algorithms  are  not  optimized  for 
vector  computers.  This  thesis  involves  the  design  and  implementation  of  a  hidden-surface 
removal  algorithm  for  the  Cray-2  vector  supercomputer,  with  the  goal  of  real-time  image 
display. 

A  z-buffer  hidden-surface  algorithm,  written  in  Pascal,  was  vectorized  and  imple¬ 
mented  on  the  Cray-2.  Special  attention  was  directed  toward  the  methodology  of  algorithm 
and  data  structure  design  to  exploit  the  Cray-2  architecture.  Timing  studies  comparing 
the  vector  version  to  the  equivalent  scalar  version  showed  that  while  the  Pascal  vector  code 
produced  localized  speedups,  the  vector  code  was  less  efficient  than  the  scalar  code.  When 
critical  portions  of  the  code  were  translated  from  Pascal  to  FORTRAN,  significant  speedup 
was  achieved,  indicating  that  the  FORTRAN  compiler  generates  more  efficient  object  code 
than  the  Pascal  compiler.  However,  real-time  performance  was  not  achieved.  Based  on 
the  knowledge  gained  from  vectorizing  the  z-buffer  algorithm,  vectorization  heuristics  were 
discovered  and  a  new  algorithm  which  should  provide  further  speedup  is  presented. 

Experience  from  the  thesis  research  indicates  that  when  attempting  to  achieve  real¬ 
time  performance,  consideration  must  be  given  to  compiler  efficiency,  supercomputer  archi¬ 
tecture,  type  of  data  displayed,  and  display  algorithm.  Relying  only  upon  good  algorithm 
design  may  not  produce  adequate  performance,  and  if  any  of  these  four  areas  is  not  prop¬ 
erly  addressed,  performance  suffers.  Another  important  observation  is  that  vectorization 
is  difficult  and  can  obscure  the  clear  function  of  software,  decreasing  software  readability 
and  perhaps  maintzunability.  Thus,  transforming  scalar  software  into  software  structured 
for  vector  operations  should  be  done  sparingly;  only  when  absolutely  essential  to  meet  per¬ 
formance  requirements.  However,  despite  its  difficulties,  vectorization  is  a  powerful  way  to 
increase  software  efficiency. 


A  VECTORIZED  HIDDEN-SURFACE  ALGORITHM 
IMPLEMENTED  ON  THE  CRAY-2 
SUPERCOMPUTER 
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1.1  Introduction 

Recent  developments  in  scientific  computing  have  prompted  the  need  for  supercom¬ 
puter  graphics  research.  These  developments  include  the  requirement  to  visualize  the  large 
quantities  of  scientific  data  which  are  processed  by  the  supercomputer.  Unfortunately,  the 
lack  of  computer  graphics  algorithms  specifically  developed  for  the  supercomputer  archi¬ 
tecture  hinders  this  visualization.  This  thesis  involves  the  design  and  implementation  of  a 
graphics  algorithm  developed  specifically  for  the  Cray-2  supercomputer  architecture. 

This  chapter  presents  the  background  material  which  justifies  the  need  for  the  thesis. 
It  also  examines  the  role  of  computer  graphics  workstations  and  their  relationship  to  the 
supercomputer  and  discusses  the  probable  contribution  of  the  supercomputer  in  future 
computer  graphics  work.  Furthermore,  the  assumptions  necessary  for  completion  of  the 
thesis,  as  well  as  the  scope  of  the  thesis  are  addressed.  Finally,  the  methodology  and 
equipment  needed  to  implement  the  thesis  are  discussed.  As  for  the  remainder  of  the  thesis. 
Chapter  2  reviews  the  literature  applicable  to  this  thesis  topic,  and  Chapter  3  presents  the 
graphics  algorithm  design.  The  results  are  provided  in  Chapter  4  and  Chapter  5  outlines 
the  conclusions  and  recommendations  for  future  research. 

1.2  Background 

1.2.1  The  Popularity  of  Graphics  Workstations  Computer  graphics,  once  confined 
largely  to  mainframe  computers,  is  now  supported  by  the  small  graphics  workstation, 
primarily  because  modern  integrated  circuit  technology  can  now  support  computation¬ 
ally  intensive  computer  graphics  algorithms  by  embedding  the  graphics  algorithms  in  the 


hardware.  Such  specialized  hardware  supports  matrix  transformations,  hidden-surface  re¬ 
moval,  the  display  of  polygons  and  bicubic  surfaces,  and  a  variety  of  other  graphics  display 
functions.  Processors,  memory,  and  the  bus  structure  are  designed  and  organized  into  an 
architecture  which  provides  for  efficient  display  of  image  data.  These  special  purpose  cir¬ 
cuits  are  combined  and  packaged  into  computer  graphics  workstations,  revolutionizing  the 
image  rendering  process.  With  advances  in  Very  Large  Scale  Integrated  Circuitry  (VLSI), 
graphics  workstations  can  be  expected  to  become  more  popular  and  less  costly,  while  pro¬ 
viding  more  capability.  As  a  result  of  this  technology,  many  computer  scientists  believe 
that  the  graphics  workstation  will  usurp  the  role  of  conventional  mainframes  for  process¬ 
ing  and  displaying  graphics  data  (Frenkel,  1988:114).  However,  the  memory  and  processor 
power  required  to  handle  large  quantities  of  data  typically  generated  by  supercomputers 
can  quickly  overwhelm  these  graphics  workstations. 

1.2.2  The  heed  to  Visualize  Data  Produced  by  Supercomputers  A  supercomputer 
is  currently  defined  as  a  computer  capable  of  100  Million  or  more  Floating  Point  Oper¬ 
ations  per  Second  (FLOPS).  However,  as  technology  produces  more  capable  computers, 
the  definition  of  the  supercomputer  will  change,  but  generally,  a  supercomputer  provides 
a  magnitude  of  computational  power  above  that  of  nearly  all  other  computers.  Typically, 
supercomputers  support  vector  operations  and  have  several  processors  and  specialized 
hardware  such  as  multiple  functional  units  and  pipelines.  Computational  fluid  flow  anal¬ 
yses,  weather  simulations,  image  processing,  nuclear  effects  simulations,  and  other  time 
consuming  applications  are  routinely  performed  on  supercomputers  and  require  memory 
and  computational  power  that  presently  only  supercomputers  can  provide  (Staudhammer, 
1987:25)  and  (Frenkel,  1988:119).  Furthermore,  these  applications  generate  mountains  of 
data  much  of  which  is  provided  as  rows  and  columns  of  numbers,  a  format  which  is  not  ef¬ 
ficiently  comprehended  by  humans.  Additionally,  so  much  data  is  produced  that  it  cannot 
be  used  (McCormick  and  others,  1987:vii,B-15). 

Due  to  the  magnitude  of  data  being  generated  by  supercomputers,  an  initiative  in 
computing,  termed  “Visualization  in  Scientific  Computing,”  is  receiving  attention  from 
scientists  in  a  variety  of  disciplines.  Briefly  stated,  the  goal  of  visualization  is  “to  lever- 
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age  existing  scientific  methods  by  providing  new  scientific  insight  through  visual  methods 
(McCormick  and  others,  1987:3).”  The  initiative  is  an  attempt  to  transform  numbers  and 
symbols  into  images.  Scientists  can  visually  observe  the  results  of  simulations  and  compu¬ 
tations  by  transforming  collections  of  numerical  data  from  a  multitude  of  .  ources  to  visual 
data.  This  transformation  into  a  visual  medium  recognizes  that  humans  are  capable  of 
understanding  and  interpreting  a  vast  amount  of  data  when  the  data  is  presented  visually 
in  three  dimensions.  Those  involved  in  the  visualization  effort  believe  that  “the  ability 
of  scientists  to  visualize  complex  computations  and  simulations  is  absolutely  essential  to 
ensure  the  integrity  of  analyses,  to  provoke  insight  and  to  communicate  those  insights  with 
others  (McCormick  and  others,  1987:7).”  Unfortunately,  this  data  is  so  voluminous  that 
using  computer  graphics  workstations  to  reduce  it  to  a  visual  form  is  costly  and  laborious 
and  may  be  more  quickly  performed  by  supercomputers.  Additionally,  the  volume  of  su¬ 
percomputer  data  exceeds  what  can  be  transmitted  to  the  workstation  in  any  reasonable 
time  using  existing  conventional  networks  (10-50  megabits/second). 

1.2.3  The  Future  of  Computer  Graphics  Debate  on  the  future  of  computer  graphics 
continues.  While  the  need  to  visually  display  complex  data  generated  by  supercomputers 
is  acknowledged,  the  precise  mechanism  for  this  display  is  unknown.  The  view  of  those 
who  advocate  using  workstations  to  render  and  display  the  data  is  given  by  Bruce  H.  Mc¬ 
Cormick,  one  of  the  editors  of  a  July  1987  report,  “Visualization  in  Scientific  Computing:” 

The  visualization  problem  is  not  peculiar  to  supercomputers  in  any  sense.  In 
fact,  I  don’t  expect  the  bulk  of  (the  visualization  initiative]  to  be  tied  to  them. 
Supercomputers  are  in  many  ways  dinosaurs,  a  dying  breed.  Their  costs  indi¬ 
cate  that  that  market  will  not  be  very  significant  within  the  computer  market. 

The  workstation  market  is  much  larger,  and  eventually  number  crunching  and 
modeling  will  be  delegated  between  supercomputers  and  workstations.  Parallel 
processing  will  find  its  way  into  minisupercomputers,  and  as  more  and  more 
firepower  is  needed  they  will  also  serve  as  local  graphics  and  image  processing 
machines  (Frenkel,  1988:113). 

Further  elaborating  the  workstation  position,  Donald  P.  Greenberg,  Professor  of 
Computer  Graphics  at  Cornell  University,  predicted  that:  “the  future  of  visualization  is 
going  to  depend  on  high-powered  workstations,  which  are  connected  over  high  bandwidths 
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to  supercomputers  (Frenkel,  1988:114).”  While  most  generally  agree  that  the  supercom¬ 
puter  is  necessary  for  preliminary  data  collection  and  reduction,  some  believe  it  is  too 
expensive  to  use  supercomputers  to  interactively  render  the  data.  However,  Nelson  Max,  a 
computer  scientist  at  Lawrence  Livermore  National  Laboratory  disagrees:  “as  an  algorithm 
developer,  I  cannot  use  a  workstation  that  has  specific  rendering  algorithms  already  built 
into  it.  I  want  to  create  new  algorithms,  and  to  do  this,  I  need  a  general-purpose  computer 
rather  than  specialized  hardware  to  speed  up  old  algorithms  (Frenkel,  1988:119).”  Nelson 
Max  acknowledges  that  supercomputers  are  a  costly  means  of  doing  business,  but  believes 
that  for  some  problems  there  may  be  no  alternative.  And  those  who  would  delegate  most 
computer  graphics  processing  to  workstations  admit  that  “supercomputers  are  required 
when  rendering  software  is  not  supported  by  workstation  firmware  or  when  pushing  the 
frontiers  of  the  number  of  floating  point  operations  carried  out  per  pixel  (McCormick  and 
others,  1987:B-13).”  Certainly,  the  visualization  of  complex  data  pushes  the  state  of  the 
art  in  current  image  generation  capability. 

The  entire  debate  is  similar  to  the  ongoing  squabble  between  the  advocates  of  se¬ 
quential  uniprocessor  machines  and  the  advocates  of  multiprocessor  machines.  Each  group 
predicts  the  demise  and  extinction  of  the  other.  One  is  also  reminded  of  the  current  view 
held  by  some  computer  scientists  of  the  eventual  demise  of  the  mainframe  computer,  to  be 
replaced  by  many  smaller  microcomputers. 

1.2.4  Supercomputer  Graphics  Research  Partially  due  to  the  high  costs  of  super¬ 
computing,  current  research  in  the  area  of  developing  supercomputer  graphics  algorithms 
is  sporadic,  but  is  gaining  attention.  Surprisingly,  a  major  user  of  supercomputer  graphics 
capability  is  the  motion  picture  industry,  for  which  the  supercomputer  is  essential  for  cre¬ 
ating  animation  (Frenkel,  1988:119).  However,  the  supercomputer  user  base  is  small  and 
software  for  supercomputers  is  expensive  to  develop,  so  the  variety  of  software  available 
to  supercomputer  graphics  users  has  suffered  and  is  less  than  that  available  to  graph¬ 
ics  workstation  users.  Additionally,  supercomputer  visualization  software  is  “primitive, 
nonexistent,  or  very  expensive,”  and  “existing  tools  are  not  adequate  to  meet  user  needs 
(McCormick  and  others,  1987:B-15,  ix).”  Indeed,  the  supercomputer  users  are  finding  that 
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turing  of  computer  graphics  algorithms  for  the  supercomputer  is  not  productive.  Third, 
hidden-surface  removal  algorithms  are  the  most  versatile  and  useful  graphics  algorithms  for 
vectorization.  Therefore,  the  search  for  algorithms  to  vectorize  is  limited  to  hidden-surface 
removal  algorithms. 

1.5  Scope 

The  scope  of  the  thesis,  which  follows  from  the  above  assumptions,  is  to  examine  the 
supercomputing  aspect  of  visualization.  No  attempt  to  restructure  graphics  algorithms  for 
use  oil  graphics  workstations  is  made.  This  thesis  concentrates  on  the  performance  of  the 
scalar  algorithm  verses  the  vectorized  algorithm  but  does  not  compare  the  value  of  the 
supercomputer  to  the  graphics  workstation  for  real-time  graphics  display.  Furthermore, 
only  hidden-surface  removal  algorithms  are  examined  for  vectorization,  with  the  primary 
objective  of  gaining  experience  and  insight  into  how  to  vectorize  software.  Thus,  real-time 
performance  is  not  critical  for  success,  but  is  a  goal. 

The  overall  problem  involves  the  development  of  a  hidden-surface  removal  algorithm 
which  is  vectorized  for  the  Cray-2  supercomputer.  This  enlarges  the  base  of  graphics  soft¬ 
ware  available  to  Cray-2  supercomputers  users  and  provides  a  means  for  the  visualization 
of  some  of  the  massive  amounts  of  data  generated  by  the  Cray-2  supercomputer.  The 
stopping  criteria  is  the  vectorization  of  one  hidden-surface  algorithm  for  the  rendering  of 
polygonal  data  (objects  described  solely  by  planar  polygons)  and  bicubic  surfaces  (objects 
described  by  bicubic  patches). 

1.6  Approach/Methodology 

The  approach  to  the  required  research  is: 

1.  Examine  the  Cray-2  supercomputer  architecture  to  determine  how  an  algorithm  is 
best  restructured  to  take  advantage  of  the  Cray-2  hardware.  For  the  successful 
vectorization  of  software,  the  supercomputer  hardware  must  be  understood. 

2.  Review  the  literature  for  vector  and  parallel  algorithms.  Special  attention  is  given  to 
identifying  features  of  an  algorithm  which  make  it  a  good  candidate  for  vectorization. 
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3.  Review  and  survey  existing  image  rendering  hidden-'^’irface  algorithms  for  polygonal 
surfaces.  Among  the  algorithms  that  may  be  considered  are:  Z-buffer,  A-buffer, 
Binary  Space  Partitioning,  Depth-sort,  and  Scan-line  algorithms. 

4.  Review  and  survey  existing  bicubic  surface  image  rendering  hidden-surface  algo¬ 
rithms.  Among  the  algorithms  that  may  be  considered  arc;  the  Blinn  algorithm, 
the  Whitted  algorithm,  the  Lane- Carpenter  algorithm,  and  the  adaptive  forward 
differencing  algorithm. 

5.  Choose  one  of  the  algorithms  in  the  above  two  categories  that  provides  the  greatest 
potential  for  vectorization  and  implement  it  on  the  Cray-2  supercomputer.  Special 
attention  is  directed  toward  the  methodology  of  algorithm  and  data  structure  design 
to  exploit  the  supercomputer  hardware. 

6.  Conduct  timing  studies  on  the  Cray-2  supercomputer.  The  ability  to  achieve  near 
real-time  animation  (10  frames  per  second)  of  reasonable  object  descriptions  (1000  to 
10,000  polygons)  is  targeted.  A  comparison  of  the  Cray-2  scalar  algorithm  execution 
time  to  the  vectorized  algorithm  is  performed. 

7.  Collect  heuristics  for  successful  vectorization  based  upon  experience  developing  soft¬ 
ware  for  the  Cray-2. 

1.1  Materials/Equipment 

The  Air  Force  Weapons  Laboratory  (AFWL)  at  Kirtland  Air  Force  Base,  New  Mex¬ 
ico,  provides  access  to  the  Cray-2  supercomputer  with  a  Pascal  vectorizing  compiler  for 
implementing  the  hidden-surface  removal  algorithm.  The  Cray-2  is  connected  by  a  100 
megabyte  per  second  channel  to  an  Ultra  frame  buffer  with  1280  x  1024  pixel  resolution 
display.  For  work  done  at  the  Air  Force  Institute  of  Technology,  an  Arpanet  connection  to 
the  facilities  at  AFWL  is  provided. 

1.8  Conclusion 

This  work  provides  insight  into  how  algorithms,  particularly  computer  graphics  al¬ 
gorithms,  can  be  vectorized  to  exploit  supercomputer  architectures.  The  strengths  and 


weaknesses  of  the  supercomputer  for  graphics  work  are  revealed.  Additionally,  it  reinforces 
the  usefulness  of  the  supercomputer  for  displaying  data  and  contributes  to  the  supercom¬ 
puter  software  base.  Finally,  the  supercomputer  may  provide  adequate  speedup  to  show 
that  images  can  be  rendered  in  real-time. 
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II.  Literature  Search 


2.1  Introduction 

Chapter  2  reviews  the  background  information  necessary  to  investigate  the  practi¬ 
cality  of  using  the  supercomputer  to  render  images  in  real-time  by  restructuring  computer 
graphics  algorithms  to  exploit  the  advantages  of  vector  and  parallel  computers.  Because 
computer  graphics  algorithms  are  computationally  intensive,  they  may  benefit  from  the 
speedup  achieved  by  vectorizing  the  algorithm  for  execution  on  the  Cray-2  supercomputer. 
Successful  vectorization  requires  a  working  knowledge  of  the  computer  architecture  and 
this  thesis  discusses  the  Cray-2  supercomputer  architecture  and  some  of  its  features  which 
provide  for  the  speedup  of  conventional  computer  graphics  algorithms.  The  visualization  of 
scientific  data,  and  its  relationship  to  the  display  of  real-time  computer  generated  images 
is  also  examined.  Finally,  the  algorithms  for  displaying  images  and  the  general  qualities  of 
algorithms  which  provide  the  greatest  potential  for  vectorization  are  identified. 

2.2  The  Cray-2  Supercomputer 

2.2.1  Architecture  The  Cray-2  supercomputer  is  the  successor  to  the  Cray-1  and  is 
a  Multiple  Single  Instruction  Multiple  Data  (MSIMD)  vector  computer.  Its  architecture 
is  similar  to  the  Cray-1,  but  it  has  been  improved.  Operating  at  a  4.1  ns  clock  rate,  the 
Cray-2’s  four  processors  can  provide  peak  performance  of  over  975  Million  Floating  Point 
operations  per  Second  (MFLOPS).  The  Cray-2  boasts  256  million  64-bit  words  of  main 
memory  (2  gigabytes).  It  contains  eight  32-bit  address  registers  (A0-A7),  eight  64-bit  scalar 
registers  (S0-S7),  and  eight-64  word  vector  registers  (V0-V7,  4096  bits  per  vector  register). 
Each  of  its  four  processors  contains  nine  functional  units  to  perform  address  add,  address 
multiply,  scalar  integer,  scalar  shift,  scalar  logical,  vector  integer,  vector  logical,  floating 
point  add,  and  floating  point  multiply  operations  (Cray  Research,  1987:2-8).  It  supports 
multitasking  and  multiprogramming.  The  Cray-2  supercomputer  with  four  processors  can 
support  a  maximum  of  eight  HSX  controllers,  each  providing  an  853  megabits  per  second 
connection  to  high-speed  devices.  This  large  bandwidth  is  absolutely  essential  for  real-time 
display.  Figure  2.1  shows  the  Cray-2  supercomputer. 
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The  clear  tubes  behind  the  cylindrically  shaped  computer  are  reservoirs  used  to  hold 
the  inert  fluorocarbon  liquid  when  the  Cray-2  is  drained  for  maintenance.  This  inert 
fluorocarbon  liquid  engulfs  and  cools  the  Cray-2’s  integrated  circuitry. 

2.2.2  Cray-2  Memory  Organization  An  understanding  of  the  Cray-2’s  memory  or¬ 
ganization  is  essentieJ  for  effective  vectorization.  Due  to  the  Cray’s  large  memory  capacity, 
program  execution  speed  can  be  significantly  affected  by  haphazard  memory  reference. 
Figure  2.2  depicts  the  memory  organization. 

The  Cray-2’s  256  MW  of  memory  is  dynamic  metallic  oxide  semiconductor  (MOS). 
The  memory  is  divided  into  four  quadrants  of  64M  words  each,  each  quadrant  containing  32 
2MW  banks,  and  each  bank  is  subdivided  into  two  pseudo-banks  (Cray  Research,  1987:B- 
3). 

According  to  the  Air  Force  Supercomputer  Center,  access  to  the  Cray-2’s  memory 
is  “considerably  slower  than  access  to  the  smaller  (2MW)  memory  of  the  Cray-1  (Kam- 
rath, 1988:4).”  The  Cray-2’s  four  processors  have  access  to  any  given  quadrant  every  fourth 
clock  period.  During  the  next  three  clock  periods,  a  processor  is  prevented  from  access¬ 
ing  the  quadrant  it  most  recently  accessed.  Thus,  if  a  computer  scientist  designs  a  data 
structure  in  which  elements  that  occupy  the  same  quadrant  are  referenced  in  sequence, 
the  memory  bottleneck  degrades  performance  appreciably  and  may  eliminate  or  reverse 
any  speedups  due  to  vectorization.  The  way  in  which  elements  in  memory  are  accessed  is 
referred  to  as  stride.  For  example,  a  stride  of  one  denotes  the  accessing  of  successive  words 
in  contiguous  memory,  while  a  stride  of  two  denotes  accessing  elements  in  every  other  word 
of  memory.  Table  2.1  lists  the  Cray-2’s  performance,  in  MFLOPS,  for  some  simulations 
in  which  the  stride  was  varied.  Note  the  serious  degradation  in  performance  as  the  stride 
varies  from  one  to  256. 

Avoiding  access  of  the  same  quadrant  in  succession  is  not  the  only  concern  the  user 
faces.  Further  attention  must  be  directed  to  avoiding  access  of  the  same  pseudo-bank  in 
succession.  After  a  pseudo-bank  is  referenced,  it  cannot  be  referenced  again  until  after  the 
chip  reservation  time  has  expired.  The  chip  reservation  time  is  45  clock  periods.  Thus,  if 
the  same  pseudo-bank  is  accessed  twice  in  a  row,  the  processor  must  idle  for  45  clock  periods 
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256  MW  of  Main  Memory  is  divided  into  4  Quadrants 


Quadrant  1 

Quadrant  2 

Quadrant  3 

Quadrant  4 

Each  64MW  Quadrant  contains  32  Banks 


Each  2MW  Bank  contains  2  Pseudo  Banks 


Pseudo-bank  1  I  Pseudo-bank  2 


Figure  2.2.  Cray-2  Memory  Organization  (Conley,  July  1988) 
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Stride 


MFLOPS 


1 

83.33 

2 

31.25 

3 

83.02 

4 

19.23 

5 

82.54 

32 

4.43 

64 

1.50 

128 

0.87 

256 

0.50 

Table  2.1.  Performance  Figures  for  Different  Strides  (Kamrath, 1988:4) 


after  the  first  access.  If  different  pseudo- banks  in  the  same  bank  are  accessed  in  succession, 
the  processor  idles  25  clock  periods  between  access.  It  should  now  be  evident  that  data 
structure  design  and  control  structure  design  have  a  profound  impact  on  performance.  For 
maximum  performance,  the  user  must  be  careful  to  ensure  that  his  vector  operands  occupy 
contiguous  elements  in  memory,  with  the  optimum  stride  of  one.  Strides  that  are  non-zero 
powers  of  two  must  be  avoided,  and  strides  of  256  are  the  worst. 


2.2.3  The  Advantages  of  Vectorization  The  primary  method  that  the  Cray-2  uses 
to  increase  its  speed  over  conventional  computers  is  its  special  design  which  has  been 
optimized  to  perform  vector  operations.  In  High-Performance  Computer  Architecture, 
Stone  lists  four  advantages  of  vector  computers: 

1.  “Provide  vector  instructions  to  take  advantage  of  [vectorization]  for  numerical  appli¬ 
cations. 

2.  Provide  facilities  to  extend  the  range  of  applicability  of  the  architecture  beyond  vector 
processing. 

3.  Use  multiple  levels  of  memory,  particularly  high-speed  buffers;  and 

4.  Mix  pipeline  and  parallel  techniques  in  various  degrees  to  achieve  an  acceptable  value 
to  price  performance  (Stone,  1987:271).” 

2.2.4  The  Advantages  of  Pipelining  Vector  operations  are  implemented  using  vec¬ 
tor  pipelines,  enabling  vector  computers  to  produce  one  arithmetic  computation  (an  add. 


multiply,  or  divide)  at  a  maximum  of  one  result  every  clock  cycle.  A  pipeline  is  a  piece  of 
computer  hardware  which  breaks  an  operation  into  a  number  of  tasks,  each  of  these  tasks 
executing  simultaneously  with  other  tasks.  Stone  defines  a  pipeline  as  “a  structure  that 
consists  of  a  sequence  of  stages  through  which  a  computation  flows  with  the  property  that 
new  operations  can  be  initiated  at  the  start  of  the  pipeline  while  other  operations  are  in 
progress  through  the  pipeline  (Stone, 1987:406).”  Pipelines  enhance  computer  performance 
through  temporal  (time)  parallelism,  while  parallel  processors  achieve  their  improvements 
through  spatial  (space)  parallelism.  Both  are  forms  of  parallelism,  and  the  Cray-2  uses 
both  to  achieve  performance  improvements. 


A  pipeline  may  contain  any  number  of  stages,  each  stage  a  subset  of  the  particular 
operation  to  be  performed.  A  pipeline  is  analogous  to  an  assembly  line,  and  for  computer 
purposes,  it  is  basically  an  arithmetic  operation  assembly  line.  For  example,  a  floating 
point  multiply  pipeline  might  contain  three  stages,  each  performing  a  part  of  the  floating 
point  multiply  operation  in  sequence: 


1.  Add  the  exponents, 

2.  Multiply  the  mantissas,  and 

3.  Normalize  the  result. 


This  pipeline  would  produce  a  maximum  speedup  of  three  over  a  non-pipelined  approach 
to  floating  point  multiply.  Generally,  for  a  vector  pipeline  with  K  stages,  it  takes  K  clock 
periods  for  the  first  vector  element  to  emerge  from  the  pipeline,  and  each  additional  vector 
element  takes  one  clock  period.  However,  for  a  scalar  operation,  if  the  first  element  of 
a  vector  takes  N  clock  periods  to  complete,  then  each  additionsd  element  takes  N  clock 
periods  to  complete.  If  the  scalar  time  to  operate  on  the  first  vector  element  is  the  same 
as  the  vector  time  to  operate  on  the  first  vector  element  {K  =  N),  then  theoretically,  a 
pipeline  with  K  stages  can  produce  a  speedup  of  K  times  over  a  non-pipelined  operation, 
assuming  the  filling  and  emptying  of  the  pipeline  is  negligible.  Pipelining  also  provides 
for  the  efficient  use  of  memory  bandwidth  (Stone,  1987:274),  absolutely  necessary  for 
supercomputers  with  large  memory  capacity. 
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An  algorithm  which  operates  efficiently  in  scalar  may  not  be  the  best  candidate  for 
vectorization.  This  is  due  to  the  overhead  required  to  set  up  the  vector  pipeline.  In  general, 
the  breakeven  point  varies  depending  on  the  operation,  according  to  the  equation: 

T  =  V[t)  +  L-1  +  0  (2.1) 

where  T  is  the  total  time  to  implement  a  vector  operation,  V(t)  is  the  time  to  implement 
the  first  vector  operation,  L  is  the  vector  length,  and  O  is  the  additional  overhead  required 
to  set  up  the  vector  pipeline  over  the  scalar  operation.  The  breakeven  point  is  achieved 
when: 

L  X  5(0  >  T  (2.2) 

where  5(f)  is  the  time  to  implement  one  scalar  operation.  Substituting  the  expression  for 
T  given  by  Equation  2.1  into  Equation  2.2  results  in: 

I  X  5(f)  >  V(f)  + I  -  1 +0  (2.3) 

If  the  overhead  is  negligible  (0  clocks)  and  V(t)  =  S(t),  then  it  is  advantageous  to  vectorize 
when  the  vector  length  is  one  or  more.  (On  the  Cray-2,  V(t)  is  always  slightly  greater  than 
S(t)).  However,  if  the  overhead  is  equal  to  the  time  to  execute  one  operation,  then  the 
vector  length  must  be  greater  than  two  before  the  breakeven  point  is  achieved.  Thus,  one 
must  not  strive  to  vectorize  everything,  but  only  those  areas  where  speedup  is  achieved 
over  the  scalar  operation.  An  important  caveat  must  be  observed.  Only  hardware  execu¬ 
tion  speeds  have  been  considered  here,  but  software  execution  is  an  equal  if  not  a  more 
important  consideration.  The  time  to  set  up  a  loop  that  is  executed  in  scalar  mode  may 
be  more  time  consuming  than  the  vector  implementation  of  the  loop,  even  though  the 
vector  length  is  smaller  than  the  required  breakeven  vector  length,  because  the  software 
loop  overhead  may  cause  the  vector  operation  to  execute  more  efficiently  (Conley,  June 
1988).  Conversely,  the  vector  implementation  of  a  program  may  be  less  efficient  than  the 
scalar  version  because  of  the  software  overhead  associated  with  manipulating  data  to  take 
full  advamtage  of  vector  operations. 

Unfortunately,  pipelining  is  an  expensive  method  to  produce  results  when  the  pipeline 
must  be  prematurely  emptied  for  any  number  of  reasons.  Branching  instructions  and 
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pipeline  interlock  (caused  by  recurrence  relations)  can  significantly  degrade  the  perfor¬ 
mance  of  a  pipeline  (Stone,  1987:148).  With  the  overhead  to  set  up  a  pipeline,  it  is 
possible  that  a  pipeline  can  cause  the  computer  to  perform  more  poorly  than  if  there  had 
been  no  pipeline.  Theie  are  methods  to  minimize  these  problems,  such  as  branch  predic¬ 
tion  techniques,  but  the  most  efficient  method  for  solving  the  problems  posed  by  pipelines 
is  to  structure  the  software  in  a  manner  that  branches  and  interlocking  are  minimized. 

Another  requirement  is  to  minimize  the  use  of  operations  which  are  time-consuming. 
On  the  Cray-2,  divi'^ion  is  more  expensive  than  multiplication,  and  division  must  be  avoided 
whenever  possible.  The  Cray-2  has  no  hardware  divider,  so  it  must  compute  the  reciprocal 
of  the  divisor  and  multiply  by  the  reciprocal.  However,  forming  this  reciprocal  is  a  time 
consuming  process,  and  if  the  same  divisor  is  used  repeatedly  in  a  number  of  statements, 
it  is  a  good  programming  practice  to  place  the  reciprocal  of  the  divisor  in  a  temporary 
variable  and  multiply  by  the  temporary  variable. 

2.2.5  Vectorizing  Compilers  Vectorizing  compilers  automatically  structure  soft¬ 
ware  to  take  advantage  of  pipelined  and  vector  architectures.  While  vectorizing  compilers 
can  decrease  the  inefficient  use  of  the  pipeline,  the  software  must  still  be  structured  in  a 
manner  that  minimizes  pipeline  conflicts.  Hierarchical  data  structures  that  make  extensive 
use  of  pointers,  and  the  use  of  recurrence  relations  and  recursive  routines  can  be  partic¬ 
ularly  hazardous  to  efficient  pipeline  operation.  (Dyer  and  Whitmem,  1987:38),  ^Higbie, 
1983:180),  and  (Buzbee,  1986:191).  Thus,  any  software  that  attempts  to  take  advantage 
of  vector  operations  should  minimize  their  use.  A  number  of  vectorizing  compilers  are 
available  for  the  Cray-2:  Pascal,  FORTRAN,  and  C. 

2.2.6  Graphics  Supercomputers  A  promising  development  in  computer  graphics  is 
the  realization  of  the  advantages  that  vector  supercomputers  provide  for  computer  graphics 
users.  Two  computer  manufacturers  have  relesised  their  first  version  of  graphics  minisuper¬ 
computers.  Stellar  has  unveiled  its  GS 1000  graphics  minisupercomputer,  and  a  competitor, 
Ardent,  released  its  Titan  minisupercomputer.  Both  were  introduced  in  March  1988.  Also 
included  is  a  discussion  of  a  superworkstation,  the  Silicon  Graphics  IRIS  GTX.  While  these 
minisupercomputers  and  superworkstations  cannot  match  the  Crrv-2’s  performance  in  raw 


computing  power,  their  reported  performance  for  rendering  graphics  images  is  aistounding. 
Keep  in  mind,  however,  that  the  graphics  minisupercomputers’  performance  is  achieved 
by  implementing  the  most  time  consuming  operations  in  specialized  graphics  hardware, 
while  the  Cray-2  has  no  such  specialized  hardware  and  all  graphics  functions  must  be 
implemented  in  software. 

2.2.6. 1  Stellar  GSlOOO  Computer  Stellar’s  machine  is  a  20  to  25  Million  In¬ 
struction  per  Second  (MIPS)  machine  with  four  vector  pipelines,  capable  of  a  peak  perfor¬ 
mance  of  40  MFLOPS.  Its  memory  capacity  is  128  MB,  and  it  can  display  150,000  Gouraud 
shaded  z-buffered  triangles  per  second  (Apgar, 1988:255).  Its  speed  at  rendering  triangles  is 
derived  from  the  combination  of  a  special  purpose  graphics  engine  which  determines  pixel 
color  values  and  a  general  purpose  vector  floating  point  processor  pipeline  for  performing 
coordinate  transformation,  clipping,  matrix  operations,  and  other  functions  which  map 
easily  to  a  pipeline  architecture  (Apgar, 1988:258). 

2. 2. 6. 2  Ardent  Titan  Ardent  released  a  computer  similar  in  design  to  the 
Stellar  GSlOOO.  Its  clock  operates  at  62.5  ns  and  it  may  contain  up  to  four  32  bit,  16  MIP 
R2000  processors  and  128  MB  of  memory  (Diede  and  others,  1988:14).  Its  peak  perfor¬ 
mance  is  64  MFLOPS.  Its  architecture  features  three  functional  units  for  floating  point 
arithmetic,  multiplication,  and  division.  Like  the  Stellar  GSlOOO,  it  uses  vector  pipeline 
units  for  object  coordinate  tranforraation,  perspective  division,  clipping,  and  lighting  cal¬ 
culations,  and  it  contains  pixel  and  polygon  processors  responsible  for  determining  pixel 
color  values  (Diede  and  others,  1988:27).  Capable  of  displaying  200,000  Gouraud  shaded 
polygons  per  second,  it  supports  animation  rates  of  10,000  Gouraud  shaded  triangles  at 
15  frames  per  second  (Ardent,  1988). 

2. 2. 6. 3  Silicon  Graphics  IRIS  GTX  Silicon  Graphics  recently  released  their 
IRIS  GTX  “super workstation.”  While  Silicon  Graphics  calls  it  a  superworkstation  and 
not  a  minisupercomputer,  its  performance  nearly  matches  that  of  the  Stellar  and  Ardent 
machines,  and  it  boasts  many  of  the  same  architectural  features.  Its  CPU  includes  two 
16.7  MHZ  R2000  processors  with  two  floating  point  coprocessors,  and  up  to  128  MB  of 


memory.  Capable  of  20  MIPS,  the  graphics  system  can  render  100,000  z-buffered,  Gouraud 
shaded,  10  x  10  pixel  quadrilaterals  per  second  (or  120,000  triangles  per  second)  (Akeley, 
1988:239).  The  graphics  system  is  composed  of  four  subsystems: 

1.  Geometry  subsystem:  responsible  for  transforming  world  to  eye  coordinates,  clip¬ 
ping,  and  other  coordinate  transformation  functions.  This  subsystem  features  five 
geometry  engines  pipelined  together,  each  capable  of  20  MFLOPS  (100  MFLOPS 
total).  Each  geometry  engine  is  a  six  stage  pipeline  (Akeley,  1988:241,243). 

2.  Scan-conversion  subsystem:  responsible  for  edge  and  scan-segment  interpolation. 
Edge  and  scan-segment  interpolation  are  described  in  Chapter  3. 

3.  Raster  subsystem:  maintains  the  1280  x  1024  pixel  frame  buffer  (Akeley,  1988:241). 

4.  Display  subsystem:  supplies  video  information  to  the  display  (Akeley,  1988:241). 

All  three  computers  share  a  similar  partitioning  of  the  hidden-surface  display  prob¬ 
lem.  Vector  pipelines  are  used  for  coordinate  transformations,  perspective  division,  lighting 
calculations,  and  clipping,  while  edge  and  scan-segment  interpolation  and  display  are  han¬ 
dled  by  special  purpose  graphics  hardware. 

2.3  The  Visualization  Effort 

Supercomputers  produce  incredible  amounts  of  data  describing  a  variety  of  natural 
and  man-made  processes.  The  data  is  frequently  produced  in  the  form  of  numbers  and 
text,  and  transforming  this  data  into  a  visual  format  enhances  the  efficiency  with  which 
humans  can  understand  and  interpret  the  data.  However,  many  supercomputers  do  not 
have  a  graphics  environment  to  ease  the  manipulation  and  display  of  the  data.  Typically, 
this  data  requires  a  great  deal  of  memory,  and  downloading  it  to  graphics  workstations 
over  low  bandwldths  is  time-consuming  and  expensive.  Much  of  this  data  is  collected  from 
sensors  monitoring  dynamic  processes.  For  example,  the  sensors  monitoring  the  change 
in  airflow  over  an  aircraft  wing  can  quickly  produce  numerical  data  beyond  the  ability  to 
comprehend  it.  Certainly  the  understanding  of  this  airflow  data  is  enhanced  if  transformed 
and  displayed  in  real-time.  Visualizing  this  data  in  real-time  can  provide  scientists  with 
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additional  insight  that  would  not  be  possible  if  they  had  to  rely  solely  on  the  numerical 
data  (Special  Report,  1987:65). 

2.4  Real  Time  Computer  Graphics 

Visucdizing  data  in  reaJ-time  is  a  difficult  task.  Many  computer  generated  images  are 
static  images.  These  images  depict  motionless  objects,  and  for  a  very  good  reason  -  motion 
display  is  an  expensive  process.  Yet,  for  the  full  impact  of  the  visualization  of  computer 
data  to  occur,  these  images  must  show  motion,  for  the  impact  of  motion  on  an  audience  is 
almost  always  more  profound  than  static  displays,  and  there  are  numerous  problems,  such 
as  fluid  flow  analysis,  which  are  best  visualized  in  real-time. 

Real-time  motion  is  produced  by  displaying  images  in  a  sequence  approximately  30 
to  60  times  per  second  (Staudhammer,  1986:296).  Near  real-time  motion  is  produced  when 
the  images  are  displayed  at  rates  around  10  times  per  second.  For  the  best  effect,  a  display 
rate  of  60  times  per  second  is  desirable  as  an  annoying  and  distracting  flicker  occurs  when 
images  are  displayed  at  less  than  this  rate. 

As  technology  progresses,  and  as  processors  become  faister,  this  real-time  update  rate 
becomes  more  feasible  for  the  same  size  raster  display.  However,  technological  innovation 
is  also  producing  less  costly,  higher  resolution  raster  displays,  complicating  the  ability  to 
display  images  in  real-time.  For  a  raster  display  with  a  500  x  500  pixel  resolution,  a  30 
hertz  update  rate  corresponds  to  133.2  ns  update  time  per  pixel.  A  1000  x  1000  pixel  CRT 
resolution  requires  a  33.3  ns  access  time  per  pixel  for  the  same  display  rate.  Thus,  although 
the  perimeter  of  the  display  is  doubling,  the  update  rate  required  quadruples,  leading 
to  an  inverse  relationship  between  display  refresh  rate  capability  and  display  resolution. 
Either  the  viewer  has  to  be  content  with  lower  resolution  or  a  slower  refresh.  The  Ultra 
frame  buffer,  which  displays  images  generated  by  the  Cray-2  supercomputer,  supports  a 
1280  X  1024  pixel  resolution.  The  Ultra  is  double  buffered,  supporting  two  2048  x  1024 
pixel  buffers  which  allow  one  screen  to  be  displayed  while  the  other  is  loaded.  It  can 
display  a  1280  x  1024  image  at  14.7  frames  per  second,  about  half  the  real-time  rate,  when 
configured  with  buffer  switch  at  vertical  retrace.  The  Ultra  frame  buffer  with  display  is 


Figure  2.3.  Ultra  Frame  Buffer  with  Display 


shown  in  Figure  2.3.  The  Cray-2  and  Ultra  frame  buffer  block  diagram  are  shown  in  Figure 
2.4. 

In  order  to  achieve  real-time  performance,  the  graphics  software  must  produce  the 
color  value  for  all  pixels  on  the  display  in  one-thirtieth  of  a  second,  and  consequently,  the 
algorithm  chosen  to  display  the  objects  must  be  fast.  Most  graphics  software  is  computa¬ 
tionally  intensive  and  for  real-time  display  requires  computational  power  greater  than  that 
available  from  most  computer  graphics  workstations.  (Keep  in  mind  that  we  are  discussing 
graphics  software,  not  graphics  algorithms  embedded  in  hardware).  Indeed,  Staudhammer 
implies  that,  at  least  for  the  present,  “the  computations  required  for  detauled  image  pre¬ 
sentation  seem  to  mandate  supercomputer  usage.  Currently-used  algorithms  are  poorly 
suited  to  supercomputer’s  vectorization  requirements.  A  critical  view  needs  to  be  taken  at 
re-casting  the  display  calculations  into  forms  more  adaptable  to  vector  machines  (Staud¬ 
hammer,  1986:297).”  Dyer  and  Whitman  designed  vectorized  z-buffer  software,  but  its 
performance  is  not  real-time  (Dyer  and  Whitman,  1987). 


HSX  Channel 


Figure  2.4.  Diagram  of  Cray-2  with  Ultra  Frame  Buffer 


A  powerful  computer  is  not  all  that  is  required.  Once  the  computational  needs  are 
met,  the  bottleneck  may  shift  to  the  input/output  capability  of  the  display  device.  For 
real-time  graphics,  the  display  must  be  connected  to  the  computer  generating  the  pixel 
values  by  a  high  bandwidth  connection  capable  of  roughly  30  hertz  x  1280  X  1024  pixels 
X  24  bits/pixel  =  944  Mbits/sec,  an  incredibly  high  bandwidth.  Such  a  bandwidth  is  not 
easily  achieved,  but  the  bandwidth  requirement  can  be  reduced  using  data  compaction 
and  other  techniques.  Of  course,  data  compaction  requires  additional,  scarce  time.  The 
HSX  Controller,  providing  an  853  megabit  per  second  bandwidth  to  the  Ultra  frame  buffer, 
supports  a  maximum  Ultra  frame  buffer  display  rate  of  20.34  frames  per  second,  assuming 
32  bits  per  pixel  must  be  transmitted.  Of  course,  this  display  rate  exceeds  the  ability  of 
the  Ultra  when  buffer  switch  occurs  at  vertical  retrace,  and  thus  the  display  and  not  the 
HSX  connection  is  the  bottleneck. 

There  are  several  ways  to  display  objects  in  real-time.  One  of  these  ways  is  to  render 
each  frame  of  a  scene  individually  and  display  each  frame  in  non-real-time.  Each  frame  is 
stored  in  memory,  displayed,  photographed,  and  then  a  motion  picture  of  the  display  can 
be  replayed  in  real-time  by  a  movie  projector  or  videotape.  This  is  how  the  motion  in  some 
science  fiction  movies  is  synthesized,  but  it  is  extremely  time  consuming  and  in  many  cases 
is  not  feasible  for  the  viewing  of  scientific  data.  Another  method  involves  storing  a  sequence 
of  frames  in  the  computer’s  memory  and  displaying  the  frames  as  fast  as  the  display  can 
be  driven.  The  drawback  to  this  approach  is  the  large  amount  of  memory  required  to  hold 
the  frames  prior  to  display  and  the  requirement  for  a  high  bandwidth  connection  from  the 
computer  memory  to  the  display.  The  most  flexible,  yet  difficult  approach  is  to  render  and 
display  the  image  in  real-time. 

By  now,  it  is  evident  that  real-time  image  generation  is  a  particularly  time  consuming 
process.  However,  to  gain  the  advantages  of  real-time  visualization,  the  need  for  quick  dis¬ 
play  must  balance  with  the  need  for  an  accurate  image.  Staudhammer  noted  that  a  simple 
light  reflection  model  and  crude  surface  approximations  are  usually  adequate  for  rendering 
objects  in  real-time  (Staudhammer,  1986:296).  Some  of  these  surface  approximation  algo¬ 
rithms  can  be  classified  as  hidden-surface  removal  algorithms.  These  hidden-surface  algo¬ 
rithms  can  be  combined  with  a  simple  lighting  model  and  if  the  algorithm  produces  results 
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sufficiently  fast,  real-time  performance  is  achieved.  In  real-time  display,  the  most  com¬ 
monly  used  lighting  reflection  model  is  Gouraud  shading  (Bishop  and  Weimer,  1986:103). 
While  flat  shading  is  less  computationally  intensive  than  Gouraud  shading  (and  quicker), 
the  extra  realism  that  Gouraud  shading  provides  is  usually  worthwhile. 

2.5  Hidden-Surface  Removal  Algorithms 

There  are  numerous  algorithms  used  to  display  graphic.^  data.  They  vary  in  complex¬ 
ity  and  speed.  Any  algorithm  used  to  display  real-time  data  should  be  relatively  simple 
and  have  a  hidden-surface  removal  capability.  The  goal  of  hidden-surface  removal  is  to 
determine  which  surfaces  of  a  given  object  are  not  visible  from  a  particular  viewpoint  and 
to  display  only  those  surfaces  which  are  visible.  Hidden-surface  removal  is  a  fundamental 
requirement  for  most  images. 


There  are  several  methods  to  achieve  hidden-surface  removal.  In  “A  Characteri¬ 
zation  of  Ten  Hidden-Surface  Algorithms”  (Sutherland  and  others,  1974),  most  of  the 
hidden-surface  removal  algorithms  mentioned  fall  into  one  of  two  categories;  object-space 
algorithms  and  image-space  algorithms. 


2.5.1  Object-Space  Algorithms  Object-space  algorithms  determine  which  surface 
is  closest  to  the  viewer  by  comparing  each  surface  to  all  other  remaining  surfaces  and 
eliminating  surfaces  or  portions  of  surfaces  which  are  invisible.  The  computational  effort 
required  is  proportional  to  the  square  of  the  number  of  surfaces  and  these  algorithms 
are  generally  more  time  consuming  than  image-space  algorithms  (Foley  and  Van  Dam, 
1982:553),  and  thus  they  are  not  further  considered  for  vectorization. 


2.5.2  Image-Space  Algorithms  Image-space  algorithms  determine  which  surface  is 
visible  at  any  particular  resolution  point  (pixel)  on  the  display.  All  the  surfaces  which 
compose  a  scene  are  examined  at  each  resolution  point  to  determine  which  surface  is  the 
closest  to  the  viewer.  The  computational  effort  required  is  proportional  to  the  product 
of  the  number  of  resolution  points  (usually  pixels)  and  the  number  of  surfaces  (Foley  and 
Van  Dam,  1982:553).  There  are  a  number  of  image-space  algorithms  which  are  used  to 
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render  images,  including  depth-sorting  and  scan-line  algorithms.  Depth-sorting  and  scan¬ 
line  algorithms  have  been  rejected  as  candidates  for  vectorization  because  of  the  sorting 
requirements  for  these  algorithms.  As  the  number  of  objects  in  a  scene  increases,  significant 
performance  degradation  occurs  because  of  the  non-linear  nature  of  sorting.  At  best,  any 
sort  is  O(nlogn),  and  many  are  O(n^).  Two  algorithms  which  hold  the  greatest  potential 
for  real-time  performance  are  Binary  Space  Partitioning  and  the  z-buffer  approach  and  are 
discussed  in  the  following  sections. 

2.5.3  Binary  Space  Partitioning  (BSP)  BSP  algorithms  have  been  used  successfully 
for  displaying  data  in  near  real-time.  BSP  algorithms  require  the  preprocessing  of  object 
descriptions  in  object-space  and  display  of  the  objects  in  image-space.  Such  algorithms, 
while  good  for  static  environments  in  which  only  the  user’s  viewpoint  changes,  do  not 
perform  especially  well  when  the  environment  contains  a  large  number  of  moving  objects 
(Sutherland  and  others,  1974:24).  BSP  algorithms  are  usually  recursive,  which  make  them 
unsuitable  candidates  for  vectorization.  BSP  algorithms  can  be  restructured  to  make  them 
iterative,  but  the  BSP  algorithm  is  naturally  expressed  recursively,  and  restructuring  it 
iteratively  generally  makes  it  less  readable.  An  undesirable  feature  of  the  BSP  algorithm  is 
the  non-linear  increase  in  processing  time  eis  the  number  of  objects  in  the  scene  increases. 

2.5.4  Brute-force  Method  Sutherland  discusses  a  brute-force  method  to  solving 
the  hidden-surface  problem  (Sutherland,  1974:51).  The  brute-force  method  later  became 
known  as  the  z-buffer  algorithm.  At  the  time  it  was  introduced,  its  use  was  limited  due 
to  high  memory  requirements.  However,  with  the  availability  of  cheap  and  highly  dense 
memory  chips,  it  has  become  a  popular  approach.  Its  advantages  include  simplicity  and 
relatively  constant  performance,  regardless  of  the  number  of  polygons  in  a  scene,  because 
as  the  number  of  polygons  increases  the  number  of  pixels  covered  by  each  polygon  tends 
to  decrease  (Foley,  1982:561).  At  worst,  its  performance  grows  linearly  with  the  number 
of  polygons.  Since  visualization  data  typically  involves  the  display  of  complex  objects, 
an  algorithm  which  performs  well  for  a  large  number  of  objects  is  preferred.  Among  its 
disadvantages  is  susceptibility  to  aliasing  because  it  processes  polygons  in  an  arbitrsiry  or¬ 
der.  Sutherland,  in  “A  Characterization  of  Ten  Hidden-Surface  Algorithms,”  presents  an 


interesting  comparison  of  hidden-surface  algorithm  performance  in  Table  VI  (Sutherland, 
1974:54).  In  it,  he  shows  the  constant  performance  of  the  z-buffer  approach  as  the  scene 
complexity  increases.  All  of  the  other  algorithms  compared  against  the  z-buffer  increase 
non-linearly  with  scene  complexity,  and  for  complex  scenes,  the  z-buffer  approach  is  usu¬ 
ally  most  efficient.  For  relatively  non-complex  scenes,  the  other  image-space  algorithms 
perform  better  than  the  z-buffer  approach,  but  as  the  number  of  objects  in  a  scene  in¬ 
creases,  the  z-buffer’s  superior  performance  is  quite  pronounced.  Since  the  demand  for 
higher  quality  images  is  increasing,  not  decreasing,  an  algorithm  which  performs  linearly 
with  scene  comple.xity  is  essential.  Thus,  it  is  understandable  why  companies  that  special¬ 
ize  in  computer  graphics  equipment,  like  Silicon  Graphics,  Stellar,  and  Ardent  are  using 
the  z-buffer  approach  in  their  latest  state  of  the  art  computers  to  attempt  real-time  image 
generation. 

2.6  Conclusion 

It  appears  that  the  most  promising  way  to  achieve  real-time  display  of  images  is 
through  speedy  display  architectures,  connected  by  high  bandwidths  to  display  devices, 
coupled  with  a  highly  efficient,  vectorized,  hidden-surface  removal  algorithm.  The  following 
need  to  be  considered  when  designing  real-time  hidden-surface  removal  algorithms: 

1.  Use  a  simple  lighting  model. 

2.  Do  not  attempt  antialiasing  or  other  expensive  ways  to  enhance  the  image. 

3.  Connect  the  display  to  the  rendering  computer  by  a  high  bandwidth  connection. 

4.  Exploit  the  architecture  of  the  vector  supercomputer  to  fullest  extent. 

5.  Use  a  relatively  simple  display  algorithm  that  is  not  recursive  and  minimizes  the  use 
of  hierarchical  data  structures. 

From  initial  research  into  hidden-surface  algorithms,  the  z-buffer  approach  seems  the 
best  candidate  for  achieving  real-time  performance  by  vectorization. 
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III.  System  Analysis  and  Design 


3.1  Introduction 


This  chapter  covers  the  analysis  and  design  of  the  system  software  which  meet  the 
research  requirements  as  outlined  in  Chapter  1,  approach  and  methodology  section. 

3.2  System  Analysis 

System  analysis  is  the  process  of  identifying  what  the  system  needs  to  do  and  Chapter 
1  and  Chapter  2  discussed  the  research  into  the  development  of  a  system  for  visually 
displaying  data  in  real-time.  The  algorithm  chosen  for  investigation  and  implementation, 
based  on  preliminary  efficiency  considerations  and  potential  for  vectorization,  is  the  z-buffer 
depth  algorithm.  The  z-buffer  algorithm  is  a  well  known  and  widely  used  hidden-surface 
removal  algorithm.  It  requires  a  large  amount  of  memory,  which  limited  its  use  in  earlier 
generation  computers.  Memory  consumption,  however,  is  not  an  important  consideration 
for  the  Cray-2,  which  supports  two  gigabytes  of  main  memory. 

The  z-buffer  derives  its  name  from  its  requirement  for  a  z-buffer  (also  known  as  a 
depth  buffer)  in  which  the  closest  polygon  depth  at  each  resolution  point  is  stored.  It 
also  requires  a  refresh  buffer  which  stores  a  red- green- blue  (RGB)  color  triple  for  each 
point.  The  depth  buffer  is  initialized  to  a  background  depth  value,  and  the  refresh  buffer 
is  initialized  to  a  background  color.  Then  polygons  are  processed  in  an  arbitrary  order  and 
are  scan- converted  into  the  refresh  buffer.  Scan-conversion  is  the  process  of  determining 
right  and  left  polygon  edge  values  for  each  paurticular  scan-line  which  intersects  a  given 
polygon.  (Scan-lines  are  analogous  to  the  horizontal  lines  on  a  raster  display).  The  right 
and  left  edge  values  define  a  scan-segment.  For  all  the  resolution  points  which  a  given 
scan-segment  covers,  depth  and  color  values  are  computed.  For  each  point  inside  the  scan- 
segment,  the  depth  value  is  compared  to  the  current  z-buffer  depth  value.  If  the  depth 
of  the  currently  examined  point  is  closer  to  the  viewer  than  the  z-buffer  depth,  then  the 
z-buffer  depth  at  that  point  is  updated  with  the  currently  examined  resolution  point  depth, 
and  the  refresh  buffer  is  updated  with  the  RGB  color  of  the  pixel.  This  process  continues 
until  all  the  polygons  are  scan-converted.  Usually,  the  resolution  points  are  mapped  one- 
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to-one  onto  the  raster  display  pixels,  but  if  the  image  is  anti-aliased  then  several  resolution 
points  may  map  to  one  pixel. 

Foley  and  Van  Dam  summarize  the  z-buffer  algorithm  with  the  following  process 
description  (Foley, 1982:560):  “during  scan-conversion,  the  following  steps  are  performed 
for  each  point  (x,y)  inside  a  polygon: 

1.  Calculate  the  polygon  depth  z(x,y)  at  (x,y); 

2.  If  the  depth,  z(x,y),  is  less  than  the  z-buffer  depth  at  (x,y),  then: 

•  Place  z(x,y)  into  the  z-buffer  at  (x,y)  and 

•  Place  the  pixel  value  of  the  polygon  at  z(x,y)  into  the  refresh  buffer  at  (x,y).” 

Note  that  the  above  algorithm  may  be  implemented  in  a  number  of  different  ways 
and  its  scalar  implementation  may  differ  from  its  vector  implementation.  Mr.  Bob  Conley, 
a  computer  scientist  at  the  Air  Force  Weapons  Laboratory  (AFWL),  has  already  imple¬ 
mented  a  Pascal  z-buffer  algorithm  that  operates  in  scalar  mode  on  the  Cray-2.  This  thesis 
centers  on  restructuring  and  vectorizing  this  code. 

3.3  System  Design 

System  design  identifies  how  the  system  fulfills  the  requirements.  The  scalar  z-buffer 
software  is  designed  in  a  top-down  structured  manner,  and  top-down  design  techniques  are 
used  for  implementing  any  modifications  to  the  scalar  code.  Using  the  Pascal  programming 
language  eases  the  implementation  of  a  top-down  structured  design,  due  to  Pascal’s  strong¬ 
typing,  abundance  of  facilities  for  specifying  user  defined  types,  and  its  numerous  control 
constructs.  The  Cray-2  Pascal  compiler  is  an  extension  to  the  International  Standards 
Organization  (ISO)  Pascal.  Since  ISO  Pascal  does  not  support  a  syntax  which  is  easily 
mapped  to  the  Cray’s  vector  units,  Cray-2  Pascal  supplies  extensions  to  ISO  Pascal  which 
allow  the  computer  scientist  to  express  vector  syntax  naturally. 

The  system  design  can  be  broken  into  two  parts;  the  program’s  control  flow  and 
the  required  data  structures.  The  following  three  sections  describe  the  control  flow  for 


the  scalar  z-buffer,  and  the  alterations  to  the  scalar  z-buffer  to  make  the  code  suitable 
for  vectorization.  Then,  the  interpolation  methods  and  data  structures  necessary  for  the 
vector  version  are  defined. 

3.4  Control-Flow 

3.4- 1  Scalar  Z-buffer  Design  Language  Before  vectorizing  the  scalar  z-bufTer,  it  is 
necessary  to  understand  how  the  sczJar  version  of  the  z-buffer  program  works.  The  high- 
level  design  language  for  the  scalar  z-buffer  with  flat  shaded  polygons  works  as  follows: 

1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 

3.  For  all  frames  to  be  generated  do  #4  -  #7  below. 

4.  Transform  the  input  polygons  to  the  desired  view  by  rotating,  scaling,  or  translating 
the  polygons  and  autoscale,  which  centers  the  object  on  the  screen. 

5.  Transform  all  polygon  world  coordinates  to  eye  coordinates. 

6.  For  each  polygon  do; 

(a)  Clip  the  polygon  to  the  screen. 

(b)  Purge  if  a  backfacing  polygon,  perform  flat  shading  calculations  and  perform 
perspective  division. 

(c)  Each  polygon  is  composed  of  a  number  of  edges.  These  edges  must  be  broken 
into  a  number  of  smedl  horizontal  segments,  each  segment  corresponding  to  a 
pixel  on  the  display.  Determine  the  right  and  left  edge  pairs  for  each  scan-line 
which  intersects  the  polygon  in  screen-  space,  and  calculate  the  depth  values  for 
all  right  and  left  edge  pairs.  This  is  done  using  a  multiaxis  interpolation  routine 
based  upon  the  interpolation  method  presented  in  “A  Fast  Shaded- Polygon 
Renderer”  by  Roger  Swanson  and  Larry  Thayer  (Swanson  and  Thayer,1986). 
This  step  is  edge  interpolation  and  produces  scan-segments. 


(d)  For  each  scan-segment,  determine  the  depth  value  of  all  pixels  composing  the 
scan-segment.  This  is  scan-segment  interpolation.  If  the  depth  value  for  a  pixel 
is  less  than  the  z-buffer  depth  value,  deposit  the  pixel  color  into  the  refresh 
buffer  and  update  the  z-buffer  depth  value  with  the  new  depth  value. 

7.  Display  the  contents  of  the  refresh  buffer  on  the  Ultra  display. 

Reading  the  object  and  polygon  descriptions  requires  reading  from  a  disk  file.  This 
is  a  particularly  time  consuming  portion  of  the  code,  and  since  it  is  only  performed  once 
regardless  of  the  number  of  frames  generated,  it  is  not  included  in  timing  studies.  The  file 
format  description  is  included  with  the  source  code  and  is  available  upon  request. 

The  scalar  program  also  includes  a  bicubic  display  algorithm.  Once  the  bicubic  sur¬ 
faces  are  broken  into  triangles,  the  above  z-buffer  algorithm  is  used  to  render  the  triangles. 
The  decomposition  of  the  bicubic  surfaces  into  triangles  has  not  been  vectorized.  One  of 
the  factors  which  makes  vectorization  of  the  decomposition  difficult  is  the  recursive  nature 
of  the  decomposition,  since  recursion  defeats  vectorization. 

3.4.2  Vector  Z-buffer  Design  Language  Many  of  the  ideas  for  vectorizing  the  z- 
buffer  algorithm  came  from  two  excellent  articles:  “A  Vectorized  Scan-line  Z-Buffer  Ren¬ 
dering  Algorithm,”  (Dyer  and  Whitman,  1987)  and  “Fast  Scan-line  Conversion  Using 
Vectorisation,”  (Goldapp,  1986).  These  ideas,  as  well  as  ideas  from  Mr.  Bob  Conley 
(AFWL/SCPS),  are  incorporated  into  the  algorithm  for  the  vectorized  z-buffer. 

Any  z-buffer  algorithm  performs  two  basic  functions  which  can  be  structured  for  vec¬ 
torization.  These  two  functions  involve  polygon  (or  trapezoid)  edge  interpolation  (which 
produces  scan-segments)  and  scan-segment  interpolation,  which  produces  pixel  color  val¬ 
ues.  Figures  3.1  and  3.2  show  a  polygon  decomposed  into  scan-segments  and  a  polygon 
decomposed  into  trapezoids. 

Another  important  calculation  determines  the  shading  of  each  polygon  or  trapezoid. 
There  are  three  primary  methods  of  shading:  Flat,  Gouraud,  and  Phong: 

1.  Flat  shading.  This  is  the  only  shading  method  vectorized,  although  the  scalar  version 

of  the  program  implements  all  three  shading  methods.  Flat  shading  requires  the 


Figure  3.1.  Polygon  and  Polygon  Decomposed  into  Scan-segments 


3-5 


least  number  of  computations.  For  a  trapezoid  with  four  vertices,  eleven  pieces  of 
information  are  required.  All  four  vertices  require  x  and  z  screen  space  values,  and 
two  vertices  require  y  screen  space  values.  Only  two  y  screen  space  values  are  required 
because  a  trapezoid’s  top  and  bottom  extents  are  both  horizontal.  Each  trapezoid 
2l1so  requires  a  color  value. 

2.  Gouraud  shading.  Gouraud  shading  requires  twenty-two  pieces  of  information.  Each 
of  four  trapezoid  vertices  requires  x  and  z  screen  space  values,  and  two  vertices  require 
y  screen  space  values.  In  addition,  all  four  vertices  require  an  RGB  triple  (there  are 
three  pieces  of  information  for  each  RGB  triple,  one  for  each  color  component). 

3.  Phong  shading.  Phong  shading  requires  thirty-four  pieces  of  information.  In  addition 
to  the  information  required  for  Gouraud  shading,  Phong  shading  requires  twelve 
pieces  of  information.  A  three  space  normal,  containing  x,  y,  and  z  values  in  world 
space,  at  each  of  the  four  trapezoid  vertices  must  be  maintained. 

Transforming  object  coordinates  from  one  coordinate  system  to  another  (for  example, 
world  to  eye  coordinate  conversion),  and  polygon  rotations,  scaling,  and  translations  are 
adso  likely  to  benefit  from  vectorization. 

There  are  several  different  ways  to  vectorize  the  z-buffer  algorithm,  and  it  is  difficult 
to  forecast  which  is  likely  to  be  the  most  efficient,  and  it  is  more  likely  that  each  method’s 
results  vary  depending  on  the  objects  rendered.  Four  of  the  various  options  are: 

•  Method  1:  Edge  Sc  Scan-segment  Interpolation  with  height  and  length 
buckets 

1.  Vector  edge  interpolation  with  height  buckets 

2.  Vector  scan-segment  interpolation  with  length  buckets 

•  Method  2:  Edge  Sc  Scan-segment  Interpolation  with  height  buckets 

1.  Vector  edge  interpolation  with  height  buckets 

2.  Vector  scan-segment  interpolation  without  length  buckets 


3-6 


•  Method  3:  Edge  Scan-segment  Interpolation  with  length  buckets 

1.  Vector  edge  interpolation  without  height  buckets 

2.  Vector  scan-segment  interpolation  with  length  buckets 

•  Method  4:  Edge  and  Scan-segment  interpolation  without  buckets 

1.  Vector  edge  interpolation  without  height  buckets 

2.  Vector  scan-segment  interpolation  without  length  buckets 

The  following  describes  the  four  methods  in  detail.  Note  that  steps  annotated  with 
a  are  implemented  in  vector  form,  and  all  of  the  below  steps  can  be  implemented  in 
parallel. 

3. 4.2.1  Method  1:  Edge  Scan-segment  Interpolation  with  height 
and  length  buckets 

1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 

3.  For  all  frames  to  be  generated  do  #4  -  #9 

4.  *  Transform  the  resulting  input  polygons  to  the  proper  view  by  rotating,  scaling,  or 
translating,  and  autoscale. 

5.  *  Transform  all  polygon  world  coordinates  to  eye  coordinates. 

6.  For  each  polygon  do: 

(a)  Clip  the  polygon  to  the  screen. 

(b)  Purge  if  a  backfacing  polygon,  perform  flat  shading  calculations  and  perform 
perspective  division. 

(c)  Decompose  the  polygon  into  trapezoids.  Care  must  be  exercised  when  decom¬ 
posing  polygons  which  have  colinear  edges.  Trapezoidal  decomposition  ensures 
that  each  edge  is  matched  by  only  one  other  edge  and  reduces  the  requirement 


for  maintaining  an  active  edge  list  and  the  associated  sorting.  Michael  Goldapp 
states  that  “decomposition  into  trapezoids  allows  vector  operations  both  hori¬ 
zontally  and  vertically,"  and  this  decomposition  method  is  “nearly  always  best 
(Goldapp,  1986:141, 147).”  The  Silicon  Graphics  GTX  has  hardware  support  for 
this  decomposition.  In  an  article  written  about  the  GTX,  the  authors  observed 
that  “it  will  almost  always  be  desirable  to  reduce  the  full  polygons  to  collec¬ 
tions  of  screen- aligned  trapezoids  whose  parallel  edges  are  in  the  direction  of 
preferred  fill  in  the  frame  buffer.  (Akeley,  1988-241).”  Silicon  Graphics  decom¬ 
poses  polygons  into  trapezoids  with  two  parallel  vertical  edges,  while  Method 
1  decomposes  the  polygons  into  trapezoids  with  two  horizontal  edges.  Nothing 
favors  one  approach  over  the  other  (the  required  calculations  are  the  same)  with 
the  exception  that  contiguous  pixels  in  a  scan-segment  occupy  contiguous  ele¬ 
ments  in  memory.  Of  course,  contiguous  elements  in  memory  correspond  to  a 
Cray-2  memory  stride  of  one.  Note  that  the  trapezoid  decomposition  proceeds 
in  screen  space,  and  each  time  the  polygons  are  transformed  for  viewing  the 
trapezoids  must  be  recomputed. 


(d)  For  each  trapezoid  do: 


i.  *  Determine  the  height  of  each  trapezoid  (corresponding  to  the  number  of 
scan-lines  which  intersect  a  given  trapezoid)  and  place  each  trapezoid  into 
the  height  bucket.  There  is  one  height  bucket  for  each  possible  trapezoid 
height  value.  Since  there  are  1024  lines  on  the  Ultra  display,  the  possible 
height  values  range  from  zero  to  1023,  so  there  are  1024  height  buckets.  Note 

t 

that  these  buckets  consume  a  large  aimount  of  memory.  Each  trapezoid 
consumes  at  most  34  words  of  memory  (34  words  for  Phong  shading,  22 
words  for  Gouraud  shaxling,  11  words  for  Flat  shading),  and  each  height 
bucket  contains  space  for  at  most  64  trapezoids,  so  each  height  bucket 
requires  2176  words  of  memory.  All  of  the  height  buckets  together  require 
2,228,224  words  of  memory,  which  is  equivalent  to  17,825,792  bytes  of  Cray- 
2  memory.  The  vector  length  of  the  Cray-2  is  64,  so  it  is  more  advantageous 
to  operate  on  vectors  with  multiples  of  length  64,  thus  each  height  bucket 
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has  space  for  64  trapezoids. 

ii.  If  the  current  bucket  contains  64  trapezoids,  empty  the  height  bucket: 

A.  *  Determine  the  right  and  left  edge  position  values  for  all  the  trapezoids 
in  the  height  bucket.  (Edge  values  for  color  are  also  required  if  Gouraud 
shading  is  required;  color  plus  surface  normal  are  needed  for  Phong 
shading).  Now  each  trapezoid  in  the  bucket  has  been  decomposed  into 
scan-segments. 

B.  Place  the  resulting  scan-segments  into  the  appropriate  length  buckets. 
As  with  the  height  buckets,  the  length  buckets  consume  a  tremendous 
amount  of  memory.  Each  scan-segment  consumes  at  most  16  words  of 
memory  (16  words  for  Phong  shading,  10  words  for  Gouraud  shading, 
and  5  words  for  flat  shading).  Notice  that  the  memory  requirement 
is  smaller  than  for  the  trapezoid  buckets.  Each  scan-segment  bucket 
contains  space  for  at  most  64  scan-segments,  and  there  must  be  one 
scan-segment  bucket  for  each  possible  scan-segment  length.  Since  there 
are  1280  pixels  of  horizontal  resolution  on  the  Ultra  display,  the  pos¬ 
sible  length  values  range  from  zero  to  1279.  All  of  the  scan-segment 
buckets  consume  1,310,720  words  of  memory  if  Phong  shading  is  used, 
or  10,485,760  bytes.  By  now,  it  should  be  evident  that  most  computers 
do  not  contain  enough  maun  memory  for  such  buckets.  Even  the  Cray- 
1,  with  2MW  of  memory  would  not  contain  enough  memory  for  the 
height  and  length  buckets.  For  computers  which  are  limited  in  memory, 
dynamic  memory  allocation  and  disposal  may  lessen  memory  require¬ 
ments  somewhat.  However,  dynamic  memory  allocation  may  negatively 
impact  vectorization  due  to  the  extensive  use  of  pointers  and  lack  of 
contiguous  memory  storage  and  the  problem  of  unpredictable  stride. 
Fortunately,  the  Cray-2  supercomputer  provides  the  luxury  of  generous 
memory  usage. 


C.  If  the  current  scan-segment  bucket  contains  64  scan-segments,  then 
perform  z-buffering  for  all  the  scan-segments  in  the  bucket  by:  1)  * 
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Calculating  the  z-depth  and  shading  the  scan-segment.  For  Gouraud 
shading,  color  must  be  interpolated  across  the  scan-segment  prior  to 
shading  calculations,  while  for  Phong  shading,  surface  normals  must  be 
interpolated  across  the  scan-segment  prior  to  shading.  2)  Writing  the 
scan-segment  values  to  the  refresh/frame- buffer. 

7.  Empty  all  height  buckets  which  are  not  empty. 

8.  Empty  all  length  buckets  which  are  not  empty. 

9.  Display  the  contents  of  the  frame-buffer  on  the  Ultra  display. 

3. 4-2. 2  Method  2:  Edge  &  Scan-segment  Interpolation  with  height 

buckets 

1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 

3.  For  all  frames  to  be  generated  do  #4  -  #8 

4.  *  Transform  the  resulting  input  polygons  to  the  proper  view  by  rotating,  scaling,  or 
translating,  and  autoscale. 

5.  *  Transform  all  polygon  world  coordinates  to  eye  coordinates. 

6.  For  each  polygon  do: 

(a)  Clip  the  polygon  to  the  screen. 

(b)  Purge  if  a  backfacing  polygon,  perform  flat  shading  calculations  and  perform 
perspective  division. 

(c)  Decompose  the  polygon  into  trapezoids. 

(d)  For  each  trapezoid  do; 

i.  *  Determine  the  height  of  each  trapezoid  (corresponding  to  the  number  of 
scan-lines  which  intersect  a  given  trapezoid)  and  place  each  trapezoid  into 
the  height  bucket. 

ii.  If  the  current  height  bucket  contains  64  trapezoids,  empty  the  height  bucket: 


A. 


B. 


Determine  the  right  and  left  edge  position  values  for  all  trapezoids  in 
the  height  bucket.  This  decomposes  each  trapezoid  into  a  number  of 
scan-segments.  The  numerical  value  of  height  is  the  number  of  scan- 
segments  that  must  be  interpolated  for  that  trapezoid. 

For  all  of  the  64  scan-segments  perform  z-buiTering:  1)  *  Interpolate 
each  scan-segment  in  vector  mode,  producing  pixel  depth  and  color 
values  for  the  scan-segment.  2)  For  all  pixels  contained  in  the  scan- 
segment,  if  the  current  pixel  depth  is  less  than  the  z-buffer  depth,  write 
the  pixel  color  value  to  the  refresh  buffer. 


7.  Empty  all  height  buckets  which  are  not  empty. 

8.  Display  the  contents  of  the  frame-buffer  on  the  Ultra  display. 

3. 4-2. 3  Method  3;  Edge  Scan-segment  Interpolation  with  length 

buckets 


1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 

3.  For  all  frames  to  be  generated  do  #4  -  #7 

4.  *  Transform  the  resulting  input  polygons  to  the  proper  view  by  rotating,  scaling,  or 
translating,  and  autoscale. 

5.  *  Transform  the  polygon  world  coordinates  to  eye  coordinates. 

6.  For  each  polygon  do: 


(a)  Clip  the  polygon  to  the  screen. 

(b)  Purge  if  a  backfacing  polygon,  perform  flat  shading  calculations  and  perform 
perspective  division. 

(c)  Decompose  the  polygon  into  trapezoids. 

(d)  For  each  trapezoid  do: 
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i.  *  Individually  interpolate  each  edge  in  vector,  producing  scan-segment  val¬ 
ues  for  the  right  and  left  trapezoid  edges, 
ii.  Place  each  resulting  scan-segment  into  the  appropriate  length  buckets. 

A.  If  the  current  scan-segment  bucket  contains  64  scan-segments,  then  for 
all  the  scan-segments  in  the  bucket  perform  z-buffering:  1)  *  Calculate 
the  z-depth  and  shade  the  scan-segment.  2)  Write  the  scan-segment 
values  to  the  refresh/frame-bulfer. 

7.  Empty  all  length  buckets  which  are  not  empty. 

8.  Display  the  contents  of  the  frame-buffer  on  the  Ultra  display. 

3.4‘2-4  Method  4:  Edge  &  Scan-segment  Interpolation  without  buck¬ 


ets 


1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 

3.  For  all  frames  to  be  generated  do  #4  -  #7 

4.  *  Transform  the  resulting  input  polygons  to  the  proper  view  by  rotating,  scaling,  or 
translating,  and  autoscale. 

5.  *  Transform  the  polygon  world  coordinates  to  eye  coordinates. 

6.  For  each  polygon  do; 


(a)  Clip  the  polygon  to  the  screen. 

(b)  Purge  if  a  backfacing  polygon,  perform  flat  shading  calculations  and  perform 
perspective  division. 


(c) 

(d) 


Decompose  the  polygon  into  trapezoids. 

For  each  trapezoid  do: 

i.  *  Individually  interpolate  each  edge  in  vector,  producing  scan-segment  X'al- 
ues  for  the  right  and  left  trapezoid  edges. 
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ii.  *  Interpolate  individually  each  scan-segment  in  vector  mode,  producing 
pixel  depth  and  color  values  for  the  scan-segment. 

iii.  For  all  pixels  contained  in  the  scan-segment,  if  the  current  pixel  depth  is 
less  than  the  z-buffer  depth  value,  write  the  resulting  pixel  color  values  to 
the  refresh  buffer. 

7.  Display  the  contents  of  the  frame-buffer  on  the  Ultra  display. 

5.5  Edge  and  Scan-segment  Interpolation  Methods 

Edge  and  scan-segment  interpolation  are  fundamentad  operations  in  this  algorithm 
and  they  warrant  further  discussion  which  more  rigorously  defines  the  mathematical  basis 
for  the  Pascal  implementation.  Linear  interpolation  is  the  process  of  determining  the 
location  of  a  point  on  a  line  given  the  two  end  points,  and  for  edge  and  scan-segment 
interpolation,  the  desired  result  is  a  sequence  of  equally  distant  points  along  the  edge 
or  scan-segment.  The  polygon  or  trapezoid  edges  are  generated  by  linear  interpolation 
of  planar  polygons  edges  defined  in  screen  space,  and  the  scan-segments  are  generated 
from  the  end  points  of  these  edges,  thus  edge  and  scan-segment  generation  are  linear 
interpolations. 

In  general,  any  sequence  of  equally  distant  points  along  a  line  segment  defined  by 
the  end  points  and  (lend, end)  can  be  determined  by  the  Equation  3.1: 


d  start  +  {dend  —  dgtart) 


Winter  *»forf 
^end  Iftort 


(3.1) 


Where  dnew  is  the  value  of  the  dependent  variable  at  iinter',  and  interpolation  proceeds 
from  igtart  to  tend-  Three  different  methods  used  for  edge  and  scan-segment  interpolation, 
based  upon  Equation  3.1,  are  listed  below. 


1.  Scalar  interpolation. 

2.  Bucketized  interpolation. 

3.  Non-bucketized  interpolation. 


In  this  chapter,  and  in  Chapter  4,  Pascal  vector  syntax  is  used  to  express  some  of  the  thesis 
concepts  and  results.  If  Pascal  syntajc  is  unfamiliar,  refer  to  the  short  overview  provided 
in  Appendix  A. 


3.5.1  Scalar  interpolation  The  scalar  method  interpolates  a  line  segment,  defined 
by  the  end  points  {istart,dstart)  and  (iend,dend),  by  repeated  addition,  using  the  recurrence 
relation  given  in  Equation  3.2. 


Ad  *- 

^end  dgtart 

^end  “  ^siari 

do  «— 

d>gtaTt 

dn  - 

dn-\  +  Ad 

for  1  <  n  <  {iond  -  i,tart) 

(3.2) 

In  Equation  3.2,  d  is  the  dependent  variable,  i  is  the  independent  variable,  and  the 
distance  i^nd  —  Utart  is  divided  into  tend  —  i$taTt  intervals  of  equal  length.  Now,  applying 
Equation  3.2  to  edge  interpolation,  each  edge  is  defined  by  the  triples  (j/top,itop,^top)  and 
{y bottom,  X bottom,  ^bottom),  where  X  is  the  horizontal  position  in  three-space,  y  is  the  vertical 
position  in  three-space,  and  z  is  the  depth  in  three-space.  The  i  and  y  values  are  rounded  to 
the  nearest  integers,  since  they  correspond  to  pixels  on  the  display.  For  edge  interpolation, 
the  independent  variable  t  is  the  vertical  position  y,  and  the  dependent  variable  d  is  either 
X  or  z.  Thus: 


ytop 

*»tort 

- 

ybottom 

dtnd 

- 

(®top>  ^top) 

dtiari 

(,Xbottom,  ^bottom) 

For  edge  interpolation  of  the  x  values.  Equation  3.2  expressed  in  Pascal  syntax  is: 

height  :■  ytop  -  yhot;  {  Polygon  height  (y  extent  of  the  edge)> 

deltax  :■  (xtop  -  xbot) /height; 

X  :*  xbot; 

for  i  :•  1  to  height  do  {  Compute  the  x  value  where  each  } 

X  :*  X  ■4'  deltax;  {  scan-line  intersects  the  edge.  } 
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Each  scan-segment  end  point  is  defined  by  iyieft,xieft,  zu/t)  and  {yright,Xright,  Zright)- 
The  X  and  y  values  are  integers.  For  scan-segment  interpolation,  the  independent  variable 
i  is  the  horizontid  position  x,  the  dependent  variable  d  is  z,  and  y  does  not  change  across 
a  scan-segment.  Thus: 


^end 

^start 

^end 

dstart 


bright 

^Uft 

^righi 

^UJt 


Note  that  for  Equation  3.2,  Ad  cannot  be  calculated  for  line  segments  of  zero  length 
(*end  ~  =  0),  but  segments  of  zero  length  do  not  require  interpolation.  Scan-segments 

of  zero  length  occur  in  graphics  work,  and  usually  correspond  to  one  pixel  or  less  of  display 
coverage. 


S.5.2  Bucketized  interpolation  When  all  trapezoid  edges  have  the  same  non-zero 
height,  or  when  all  scan-segments  have  the  same  non-zero  length,  an  entire  bucket  of  size 
m  can  be  interpolated  by  repeated  addition  using  Equation  3.3,  which  is  really  a  form  of 
Equation  3.2. 


Ad[l..rn\ 


do[l..m] 


(f[l..Tn]  dgtar([l..Tn] 
*end[l]  ~  t*tor«[l] 

ditarAl-m] 


d„[l..m] 


<f„_i[l..m]  -I-  Ad[l..Tn] 


for  l<n<(w[l]- wt[l])  (3.3) 


For  edge  interpolation,  the  independent  variable  i  is  the  vertical  position  y,  and  the 
dependent  variable  d  is  either  x  or  z,  and  The  height  of  the  trapezoid  is  equivalent  to 


*  yfrottomlE.m] 
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For  scan-segment  interpolation,  th"  independent  variable  i  is  the  horizontal  position 
X,  the  dependent  variable  d  is  the  depthh  z,  the  length  of  the  scan-segment  is  equivalent 
to  iend[l]  —  t#«orf[l]»  and  the  number  of  scan-segments  in  the  bucket  is  m.  Thus: 


dend[l-m] 

[l  ••m] 


[l 

X/e/t[l..m] 

[f 

zuft[\..Tn] 


For  edge  interpolation  of  the  z  values,  the  mathematical  formula  in  Equation  3.3  can 
be  expressed  in  Pascal  syntax  as: 


deltazCl . .m]  :=  (ztopCl-.m]  -  zbot [1 . .m] ) /height ; 
z[l..in]  :=  zbot[l..m]; 
for  i  :=  1  to  height  do 

z[l..in]  ;=  zCl..m]  +  deltazCl .  .m]  ; 


The  Cray-2  can  handle  buckets  of  an  arbitrary  length,  but  optimum  bucket  size  is  a 
multiple  of  64. 

3.5.3  Non-bxicketized  interpolation  Non-bucketized  interpolation  of  edges  and  scan- 
segments  is  necessary  when  edges  or  scan-segments  are  not  collected  in  buckets  containing 
edges  of  identical  height,  or  scan-segments  of  identical  length,  and  when  vector  interpola¬ 
tion  is  desired.  Equation  3.4  is  a  modification  of  Equation  3.2  such  that  there  is  no  longer  a 
recurrence  relation.  Eliminating  the  recurrence  enables  the  operation  to  proceed  in  vector. 


Ad 


den(f  dgtorf 
tend  ~  t»tart 


do  *—  d,tart 


do 

0 

\ 

< —  do  Ad  X 

dn  , 

n 

for  n  —  lend  I'etart 


(3.4) 
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For  edge  interpolation,  the  independent  variable  t  is  the  vertical  position  y,  the 
dependent  variable  d  is  either  x  or  2,  and  the  height  of  the  trapezoid  is  equivalent  to 
lend  ifiart-  ThuSl 


^end 

^  Vtop 

^atart 

ybottom 

^end 

*  ^top) 

^atari 

For  scan-segment  interpolation,  the  independent  variable  i  is  the  horizontal  position 
X,  the  dependent  variable  d  is  computed  for  z,  and  the  length  of  the  scan-segment  is 
equivalent  to  iend  -  istart-  Thus; 


lend 

bright 

letart 

- 

Xleft 

^end 

- 

bright 

• 

^ttart 

Xleft 

For  scan-segment  interpolation  of  the  z  values,  the  mathematical  formula  in  Equa¬ 
tion  3.4  is  expressed  in  Pascal  vector  syntax  as; 


n  :=  zright  -  xleft;  {  Scan-segment  length  > 
deltaz  :*  (zright  -  zleft)/n;  {  Reciprocal  of  the  edge  slope  } 
2[0..n]  :*  zleft  ♦  deltaz  *  offset[0. .n] ; 


Offset  [0.  ,n]  is  an  array  filled  with  the  values  0  through  1279.  No  offset  greater 
than  1279  would  be  needed,  since  there  are  only  1280  pixels  per  horizontal  raster  display 
line. 

Caution  should  be  observed  when  vectorizing  edge  or  scan-segment  interpolation. 
Due  to  the  limited  floating  point  precision  of  any  computer,  the  non-bucketized  interpo¬ 
lation  method  may  produce  slightly  different  results  than  the  scalar  method.  This  was 
verified  when  non-bucketized  interpolation  was  used  in  the  vectorized  software.  Although 


3-17 


the  image  displayed  on  the  screen  showed  no  discernible  differences  between  the  two  meth¬ 
ods,  an  analysis  of  edge  and  scan-segment  histograms  produced  using  the  two  methods 
revealed  some  inconsistencies.  (An  edge  histogram  shows  the  number  of  trapezoids  com¬ 
posing  the  object  for  any  particular  height,  while  a  scan-segment  histogram  displays  the 
number  of  scan-segments  of  any  particular  length.  One  would  expect  the  histograms  pro¬ 
duced  using  the  scalar  method  to  match  those  produced  using  the  vector  method,  but  this 
was  not  the  case.  The  histogram  variations  were  small,  but  nonetheless  evident). 

3.6  Data  Structures 

Generally,  the  most  time-consuming  portion  of  a  z-buffer  is  the  edge  and  scan-segment 
interpolation.  Because  this  interpolation  is  performed  on  trapezoids,  the  structure  of  the 
trapezoid  data  structure  (Figure  3.3)  is  very  important.  According  to  Dyer  and  Whitman, 
who  implemented  a  vectorized  z-buffer  on  a  Convex  computer  using  the  “C”  programming 
language,  data  structures  which  are  used  to  store  data  resulting  from  vector  operations 
should  minimize  the  use  of  pointers  (Dyer  and  Whitman,  1987:38).  Since  the  use  of  pointers 
makes  the  stride  unpredictable,  Pascal  arrays  or  record  structures  should  be  used. 

The  trapezoid  bucket  data  structure  (Figure  3.4)  closely  resembles  the  trapezoid 
data  structure  so  all  of  the  trapezoid  information  can  be  placed  into  the  bucket  by  a  single 
Pascal  statement.  The  record  components  are  named  differently  to  minimize  ambiguity. 
The  scan-segment  bucket  structure  is  depicted  in  Figure  3.5. 

When  using  the  Gouraud  or  Phong  shading  methods,  the  data  structures  in  Figure 
3.3,  Figure  3.4,  and  Figure  3.5  require  expansion.  For  Gouraud  shading,  color  information 
for  all  four  trapezoid  vertices  must  be  maintained,  while  for  Phong  shading,  color  amd 
normal  information  for  all  four  trapezoid  vertices  must  be  maintained. 
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I 

i 

I 


type 

•  color  =  0. .4294967295; 

trapezoidtype  =  record 
case  t  :boolean  of 

:  real;  {top  left  x  value}- 
:  integer;  {top  y  value} 

{top  left  z  value} 

:  real;  {bottom  left  x  value} 

:  integer;  {bottom  y  value} 

{bottom  left  z  value} 
{top  right  X  value} 
{top  right  z  value} 
{bottom  right  x  value} 
:  real;  {bottom  right  z  value} 
polcol  :  color) ;  {color} 
false  : (ta: array [1 .. 11]  of  integer); 
end  {record}; 

traparr  =  array [0.. 8]  of  trapezoidtype; 
var 

trapezoid  :  traparr; 


true 


: (ulex 
uley 
ulez , 
Ilex 
Iley 
Ilez, 
urex, 
urez, 
Irex, 
Irez 


Figure  3.3.  Trapezoid 


Data  Structure 
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type 

color 


0.  .4294967295; 


mat64int  =  array [1. 

.64]  of  integer; 

mat64rea  =  array [1. 

.64]  of  real; 

mat64col  =  array [1. 

.64]  of  color; 

heightbucket  =  record 

case  t  :  boolean 

of 

true  : (xltop 

:  mat64rea; 

{top  left  X  value} 

ytop 

:  mat64int; 

{top  y  value} 

zltop , 

{top  left  z  value} 

xlbot 

:  mat64rea; 

{bottom  left  x  value} 

ybot 

:  inat64int; 

{bottom  y  value} 

zlbot , 

{bottom  left  z  value} 

xrtop. 

{top  right  X  value} 

zrtop , 

{top  right  z  value} 

xrbot , 

{bottom  right  x  value} 

zrbot 

:  mat64rea; 

{bottom  right  z  value} 

polcol 

:  inat64col) ; 

{trapezoid  color} 

false  :  (tb : array [1 . . 11 , 1 . .64]  of  integer); 
end  {record}; 

heightarr  «  array [0. . 1023]  of  heightbucket ; 
heightcnt  =  array [0 .. 1023]  of  integer; 


var 


hgtbucket  ;  heightarr; 
hgtcnt  :  heightcnt; 


Figure  3.4.  Trapezoid  Height  Bucket  Data  Structure 
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type 

color  «  0. .4294967295; 
mat64int  «  array[1..64]  of 
mat64rea  »  array [1.. 64]  of 
mat64col  =  array [1.. 64]  of 

lengthbucket  =  record 
lin, 

Ipix  :  mat64int; 

depl,  depr  :  mat64rea: 
polcol  :  niat64col; 

end  {record} ; 

lengtharr  *  array [0 .. 1279] 
lengthcnt  =  array  [0 .. 1279] 

var 

Igtbucket  :  lengtharr; 

Igtcnt  :  lengthcnt: 

Figure  3.5.  Scan-segment  Length  Bucket  Data  Structure 
3.1  Methodology 

There  are  three  different  ways  to  vectorize  the  above  design  language: 


integer: 
real ; 
color; 


{active  scemline} 

{left  pixel  value} 

{left  and  right  pixel  depths} 
{color  of  the  scan-segment} 

of  lengthbucket; 
of  integer: 


1.  Make  use  of  the  vectorizing  compiler  and  write  code  in  scalar  form. 

2.  Write  assembly  code  targeting  the  vector  units. 

3.  Use  the  Pascal  vector  syntax. 


With  the  first,  one  can  write  the  code  and  allow  the  Cray-2  vectorizing  compiler  to 
vectorize  the  code.  The  Cray-2  Pascal  vectorizing  compiler  allows  switches  to  be  set  which 
automatically  optimize  scalar  code,  and  attempts  to  vectorize  code  contained  inside  loop 
structures.  Procedure  calls  defeat  automatic  vectorization.  “If”  statements  do  not  neces¬ 
sarily  defeat  vectorization.  An  “if’  statement  can  be  vectorized  provided  the  conditional 
expression  is  a  vector  expression  and  provided  the  “then”  and  “else”  clauses  contain  only 
vector  assignments  (Conley,  Oct  1988). 
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The  one  advantage  of  method  one  is  that  an  extensive  understanding  of  the  Cray-2 
architecture  or  Pascal  vectorization  syntax  is  not  required.  Unlike  the  Cray-2  FORTRAN 
compiler,  however,  the  Pascal  compiler  does  not  indicate  when  a  particular  segment  of  code 
has  been  vectorized.  Thus,  the  Pascal  programmer  cannot  be  certain  that  a  particular 
segment  of  code  has  been  vectorized  unless  the  object  code  is  examined. 

The  second  method  is  to  write  assembly  code  for  explicit  vectorization  by  specifying 
in  Cray  Assembly  Language  (CAL)  the  loading  of  the  vector  registers  and  issuing  of  vector 
instructions.  This  method  can  produce  code  which  executes  faster  than  the  corresponding 
code  written  in  Pascal  vector  syntax,  but  its  disadvantages  include  readability  and  main¬ 
tainability  problems  that  assembly  language  programming  usually  causes.  Furthermore, 
it  would  be  difficult  to  determine  if  the  speedups  were  attributed  to  writing  in  assembly 
code  or  to  using  vectorization. 

A  reasonable  alternative  to  methods  one  and  two  is  to  use  the  Pascal  vector  syntax 
to  explicitly  indicate  where  vectorization  is  to  occur.  Thus,  the  programmer  is  neither 
dependent  on  the  automatic  vectorization  of  scalar  loop  structures,  which  may  or  may 
not  occur,  nor  is  the  programmer  required  to  learn  assembly  code  with  its  associated 
disadvantages.  The  z-buffer  code  was  vectorized  using  this  third  method. 

3.8  Summary 

The  Analysis  and  Design  lays  the  groundwork  for  a  systematic  implementation  and 
analysis  of  results.  Since  there  are  four  implementation  alternatives,  each  may  be  investi¬ 
gated,  and  the  most  efficient  alternative  chosen. 


IV.  Implementation  and  Results 


4-1  Introduction 

This  chapter  presents  the  thesis  implementation  and  results  in  three  sections.  The 
first  section,  timing  studies,  explains  the  timing  studies  performed  on  the  scalar  and  vector¬ 
ized  versions  of  the  z-buffer  code.  Heuristics  for  vectorization,  the  second  section,  focuses 
on  rules  of  thumb  for  vectorizing  scalar  code.  These  heuristics  were  discovered  while  vec¬ 
torizing  the  z-buffer  algorithm,  but  are  applicable  when  vectorizing  any  algorithm.  From 
experience  gained  by  vectorizing  the  z-buffer  code,  the  third  section  presents  a  different  ap¬ 
proach  which  should  improve  performance.  This  chapter  concludes  with  a  brief  summary 
of  the  results. 


4-2  Timing  Studies 

4.2. 1  Introduction  Timing  studies  are  essential  when  comparing  the  relative  perfor¬ 
mance  of  two  or  more  different  software  versions.  Timings  can  yield  valuable  information 
and  reveal  where  improvements  can  be  made.  However,  there  are  many  things  to  consider 
when  timing  software,  and  poorly  performed  timing  is  not  only  useless,  but  may  cause 
the  algorithm  designer  to  draw  incorrect  conclusions  about  algorithm  performance.  The 
following  sections  present  the  value  of  timing  studies  and  how  they  are  used  in  this  the¬ 
sis.  They  also  discuss  the  machine  dependency  of  timing  studies,  the  statistical  variation 
encountered  when  timing,  and  how  timing  is  performed.  Finally,  display  times  for  two 
different  objects  are  given. 


4.2.2  Timings  are  Hardware  Dependent  Timings  are  machine  dependent.  Conse¬ 
quently,  when  comparing  the  performance  of  a  vectorized  algorithm  to  its  scalar  coun¬ 
terpart,  it  is  difficult  to  make  generalizations  which  apply  to  a  broad  clziss  of  computer 
architectures.  Dyer  and  Whitman  implemented  their  z-buffer  algorithm  on  a  Convex  C-1 
supercomputer,  using  the  Convex  “C”  vectorizing  compiler,  and  they  report  a  three  to  one 
speedup  of  the  vectorized  version  over  the  scalar  version  (Dyer  and  Whitman,  1987:43). 
Depending  on  the  relative  efficiency  of  the  scalar  instructions  to  the  vector  instructions, 
such  speedup  may  not  be  difficult  to  achieve. 
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4-2.2. 1  Scalar  Optimization  When  comparing  the  execution  efficiency  of  a 
sequence  of  scalar  instructions  to  their  vector  equivalent,  it  is  necessary  to  understand 
how  execution  of  the  scalar  instructions  differs  from  vector  instructions.  The  Cray-2  has 
several  mechanisms  for  improving  the  efficiency  of  scalar  instruction  execution.  These 
include  a  scalar  optimizer  and  an  architecture  which  permits  several  instructions  to  be 
executed  simultaneously  within  the  same  processor.  The  Cray-2  computer  contains  nine 
independent  functional  units  (adders,  multipliers  etc.),  and  unlike  many  computers,  it  can 
issue  instructions  when  the  required  “functional  unit,  source  and  operand  registers,  and 
read/write  data  paths  needed  by  the  instruction  are  free  at  the  particular  clock  periods 
that  the  instruction  needs  them.  A  resource  does  not  necessarily  need  to  be  free  at  the  clock 
period  of  issue  because  the  execution  of  the  instruction  may  not  need  the  resource  until  a 
later  clock  period.  Issue  will  proceed  as  long  as  the  resource  is  available  at  the  needed  clock 
period  (Conley,  Nov  1988).  If  the  instructions  are  properly  ordered  to  eliminate  depen¬ 
dencies  (one  instruction  cannot  depend  on  the  results  obtained  in  another  simultaneously 
executed  instruction),  instructions  can  be  executed  in  several  different  functional  units  at 
once,  increasing  performance  significantly.  Depending  on  the  order  and  types  of  instruc¬ 
tions,  the  scalar  version  of  a  program  can  execute  faster  if  this  parallelism  is  exploited. 
Thus,  what  the  designer  may  consider  to  be  relatively  minor  changes  to  the  software,  like 
reordering  instructions,  can  have  a  profound  impact  on  the  performance  of  the  scalar  code. 

The  Cray  Pascal  scalar  optimizer  attempts  to  reorder  instructions,  eliminate  un¬ 
necessary  instructions,  expand  procedures  inline,  eliminate  loop  invariants  and  register 
transfers,  and  reduce  memory  use  (Cray  Pascal,  1986:13-12).  At  times,  its  precise  behav¬ 
ior  is  unpredictable,  and  minor  changes  to  the  scalar  code  may  have  a  significant  effect 
upon  performance.  This  can  frustrate  attempts  to  compare  the  relative  efficiency  of  one 
scalar  version  to  another,  or  the  efficiency  of  the  scalar  version  to  the  vector  version.  A 
pertinent  question  to  ask  is:  if  speedup  occurs,  is  the  speedup  due  to  vectorization,  or  is  it 
attributable  to  a  reordering  of  the  scalar  instructions  which  triggers  the  scalar  optimizer? 
Indeed,  the  designer  is  at  the  mercy  of  how  the  compiler  writer  chose  to  implement  the 
programming  language  features,  which  by  no  means  guarantees  the  implementation  is  an 
efficient  one. 
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4-2. 2. 2  Programming  Language  Used  to  Vectorize  Another  important  consid¬ 
eration  is  the  programming  language  used  to  vectorize  the  algorithm.  Performance  com¬ 
parisons  of  vector  to  scalar  code  across  programming  languages  is  complicated  by  compiler 
differences.  Studies  performed  by  the  Air  Force  Weapons  Laboratory  show  the  perfor¬ 
mance  of  the  relatively  mature  FORTRAN  compiler  to  be  significantly  better  than  the 
performance  of  the  Pascal  compiler  at  both  automatically  vectorizing  code  and  optimizing 
scalar  code.  The  Pascal  compiler  also  seems  to  generate  “seemingly  wasteful  instruction 
sequences  (Conley,  Aug  1988).” 

4-2.2. 3  Software  versus  Hardware  Algorithm  Implementation  As  mentioned 
in  Chapter  2,  the  z-buffer  algorithm  on  the  Cray-2  is  implemented  in  software.  Most  work¬ 
stations,  however,  have  hardware  z-buffer  support.  Algorithms  imbedded  in  hardware  can 
execute  orders  of  magnitude  faster  than  the  same  algorithms  implemented  in  software. 
Comparing  the  efficiency  of  the  Cray-2’s  architecture  design  to  the  that  of  the  worksta¬ 
tion’s  hardware  design  based  on  how  quickly  the  Cray-2  can  execute  software  instructions 
versus  how  quickly  the  workstations  can  execute  instructions  embedded  in  hardware  is 
unreasonable. 

4-2.3  Cray-2  Clock  Functions  The  Cray-2  operating  system  supports  a  number  of 
clock  functions  which  return  the  execution  time  for  a  section  of  code.  All  of  these  timing 
functions  exact  a  performance  penalty,  albeit  small.  Two  different  timing  functions,  the 
second  function  and  the  microsecond  function,  were  used  to  time  the  code. 

4-2.3. 1  Second  Function  The  second  function  uses  CPU  time  to  determine 
how  long  a  given  section  of  code  takes  to  execute.  The  user  is  not  affected  when  his 
job  loses  the  processor.  Accurate  time,  as  measured  by  the  second  function,  cannot  be 
guaranteed  for  sections  of  code  which  take  less  than  a  millisecond. 

4. 2. 3.2  Microsecond  Function  Unlike  the  second  function,  the  microsecond 
function  does  not  suffer  from  the  inability  to  measure  small  time  increments.  It  is  based 
on  the  hardware  real  time  clock,  and  is  accurate  to  within  one  Cray-2  clock  cycle,  4.1  ns. 
However,  it  continues  to  measure  time  when  the  job  loses  the  processor  (blocking),  and  is 
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very  sensitive  to  system  loading.  Because  the  Cray-2  is  heavily  used,  a  job  is  not  likely 
to  have  dedicated  use  of  the  processor  and  time  measured  by  the  microsecond  function 
will  vary  according  to  the  frequency  with  which  the  timed  section  of  code  is  blocked.  One 
way  to  circumvent  the  microsecond  function’s  limitations  is  to  have  dedicated  use  of  one 
of  the  Cray-2’s  processors,  but  this  can  be  guaranteed  only  by  prohibiting  all  others  from 
using  the  computer  -  a  rather  costly  solution.  Therefore,  the  microsecond  function  should 
be  reserved  for  those  cases  when  the  section  of  code  under  investigation  takes  less  than 
several  milliseconds,  and  the  lowest  time  measured  is  the  more  accurate  time.  Averages  of 
time  should  not  be  used,  unless  the  user  has  dedicated  use  of  the  Cray-2. 

4-2-4  Statistical  Variation  Random  events  can  complicate  timing  studies.  It  is 
possible  that  all  of  the  differences  in  timing  between  two  pieces  of  code  can  be  purely  due 
to  chance.  To  reduce  this  possibility,  in  most  instances,  five  timing  samples  were  observed 
and  a  mean  taken  for  the  five  samples.  In  all  cases,  the  deviation  of  the  samples  from  the 
mean  was  small.  Note  that  means  determined  from  microsecond  clock  timings  are  likely 
to  be  inaccurate. 

4-2.5  Timing  Caveats  Clearly,  timings  are  data  dependent.  Conclusions  drawn 
from  one  set  of  data  may  conflict  with  those  drawn  from  another,  so  it  is  usually  desirable 
to  perform  timings  on  different  data  sets.  Analysis  of  different  data  sets  should  suggest 
limitations  of  the  vector  software  and  may  indicate  which  type  of  data  is  best  operated 
upon  with  either  vector  or  scalar  software.  With  this  is  mind,  two  different  data  sets  were 
used.  One  of  the  data  sets  is  an  F-16  aircraft,  characterized  by  small  polygons  in  the 
normal  view  (the  view  used  for  testing).  The  other  data  set  describes  a  dodecahedron 
(a  polyhederon  with  twelve  faces)  and  an  icosahederon  (a  polyhederon  with  twenty  faces) 
combination,  characterized  by  large  polygons.  Of  course,  polygon  size  is  view  dependent, 
and  either  the  F-16  (Figure  4.2.5)  or  the  “Dodecalcosahederon”  (Figure  4.2.5)  viewing 
parameters  could  be  changed  resulting  in  different  polygon  statistics. 

The  F-16  data  was  generated  by  Major  Larry  Feldman  of  the  Air  Force  Weapons 
Laboratory,  and  the  “Dodecalcosahederon”  data  was  created  by  Mr.  Bob  Conley.  Relevamt 
statistics  for  the  F-16  and  the  “Dodecalcosahederon,”  important  because  they  indicate 
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•  Number  of  polygons  .  2997 

•  Number  of  pixels  .  193352 

•  Number  of  scan-segments  .  44588 

•  Number  of  scan-segments  rejected  .  248 

•  Mean  number  of  pixels  per  polygon  .  64.52 

•  Mean  scan-segment  length  (pixels)  .  4.36 

•  Mean  polygon  height  (pixels)  .  14.79 


Table  4.1.  F-16  Polygon  Statistics 


•  Number  of  trapezoids  .  7644 

•  Number  of  pixels  .  219625 

•  Number  of  duplicate  pixels  .  26273 

•  Number  of  scan-segments  .  49262 

•  Number  of  scan-segments  rejected  . 382 

•  Mean  number  of  pixels  per  trapezoid  . 28.73 

•  Mean  scan-segment  length  (pixels)  .  4.49 

•  Mean  trapezoid  height  (pixels)  .  6.39 

•  Mean  number  of  trapezoids  per  polygon . 2.55 


Table  4.2.  F-16  Trapezoid  Statistics 

vector  lengths  and  aid  in  timing  analyses,  are  shown  in  Tables  4.1  through  4.4. 

Statistics  reported  for  each  object  include  the  number  of  polygons  and  the  number 
of  trapezoids  that  compose  the  object.  Polygons  are  decomposed  into  trapezoids,  and  due 
to  the  nature  of  the  decomposition,  some  of  the  scan-segments  are  shared  across  trapezoid 
edge  boundaries.  The  “number  of  duplicate  pixels”  statistic  reports  the  extra  number 
of  pixels  which  must  be  computed  when  working  with  trapezoids.  Another  interesting 
statistic  is  the  number  of  scan-segments  rejected.  These  scan-segments  are  of  a  negative 
length  (the  right  edge  pixel  value  is  less  than  the  left  edge  pixel  value).  This  anomaly  is 
due  to  some  of  the  polygons  not  being  rejected  as  backfacing  polygons  because  of  limited 
floating  point  precision  in  the  purge  back-facing  polygon  procedure. 


•  Number  of  polygons  .  13 

•  Number  of  pixels  .  555505 

•  Number  of  scan-segments  .  4131 

•  Number  of  scan-segments  rejected  . 347 

•  Mean  number  of  pixels  per  polygon  .  42731.15 

•  Mean  scan-segment  length  (pixels)  .  146.80 

•  Mean  polygon  height  (pixels)  .  291.08 

Table  4.3.  “Dodecalcosahederon”  Polygon  Statistics 


•  Number  of  trapezoids  .  36 

•  Number  of  pixels  .  560944 

•  Number  of  duplicate  pixels  .  5439 

•  Number  of  scan-segments  .  3807 

•  Number  of  scan-segments  rejected  . 349 

•  Mean  number  of  pixels  per  trapezoid .  15581.78 

•  Mean  scan-segment  length  (pixels)  .  147.35 

•  Mean  trapezoid  height  (pixels)  .  105.75 

•  Mean  number  of  trapezoids  per  polygon  .  2.77 

Table  4.4.  “Dodecalcosahederon”  Trapezoid  Statistics 
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4-2.6  How  the  Code  was  Decomposed  for  Timing  For  timing  purposes,  the  vector 
and  scalar  versions  of  the  z-buffer  software  are  broken  into  six  areas: 

1.  Rotation,  scaling,  and  translation. 

(a)  This  section  requires  the  multiplication  of  two  four  by  four  floating  point  ma¬ 
trices. 

(b)  The  microsecond  clock  was  used  to  time  this  area  because  the  time  to  execute 
the  instructions  was  so  small. 

2.  World  to  eye  coordinate  conversion. 

(a)  This  section  requires  multiplying  all  the  three-dimensional  points  representing 
the  polygons  by  a  transformation  matrix. 

(b)  This  was  timed  using  the  second  clock. 

3.  Miscellaneous  overhead. 

(a)  This  includes  perspective  division,  lighting  calculations,  and  the  purging  of  back- 
facing  polygons. 

(b)  This  was  timed  using  the  second  clock. 

4.  Trapezoidal  Decomposition. 

(a)  Breaks  trapezoids  into  polygons. 

(b)  This  was  timed  using  the  second  clock. 

5.  Polygon/trapezoid  edge  interpolation.  Polygon/trapezoid  edge  interpolation  was 
timed  using  the  second  clock  and  can  be  performed  three  different  ways.  These 
are: 

(a)  Scalar  method. 

(b)  Height  bucket  method. 

(c)  Individual  edge  interpolation. 
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6.  Scan-segment  interpolation.  Scan-segment  interpolation  was  timed  using  the  second 
clock  and  can  be  performed  three  different  ways.  These  are; 

(a)  Scalar  method. 

(b)  Scan-segment  bucket  method. 

(c)  Individual  scan-segment  method. 


Each  area  Wcis  separately  timed  because  vectorization  may  cause  speedup  in  some 
areas  and  slowdown  in  others.  Speedup  is  measured  as:  T imcscaiar ITimCvectoT  and  speedup 
greater  than  one  indicates  that  vectorization  is  more  efficient.  Generally,  timings  were 
collected  by  enclosing  comments  around  all  portions  of  the  code  except  for  the  area  of 
interest.  Caution  must  be  exercised  here.  For  example,  if  the  overhead  for  a  procedure 
call  is  desired,  the  designer  may  be  tempted  to  time  the  execution  of  the  procedure  and 
then  time  the  execution  with  the  inner  body  of  the  procedure  commented  out.  This  may 
produce  misleading  results,  as  the  scalar  optimizer  may  detect  that  the  procedure  contains 
no  executable  instructions,  and  eliminate  the  entire  procedure  call. 

Another  rather  obvious  requirement  is  for  both  the  scalar  and  vector  software  to 
produce  the  same  results.  In  computer  graphics  work,  this  is  relatively  easy  to  verify,  since 
the  images  produced  by  each  software  version  can  be  compared.  In  other  work,  some  sort 
of  verification  must  be  used. 

Upon  analysis  of  the  four  vector  methods  presented  in  Chapter  3,  it  is  evident  that 
all  four  methods  do  not  have  to  be  individually  coded  and  tested  to  determine  which 
methods  are  more  efficient.  It  is  sufficient  to  break  the  significant  areas  of  interest  into 
code  sections  and  to  time  the  execution  of  each  section.  Then,  the  four  vector  methods  are 
reconstructed  by  adding  up  the  execution  times  for  each  of  the  sections.  In  this  manner, 
the  timing  analysis  performed  on  each  section  will  give  a  rough  idea  of  the  most  efficient 
method. 

The  four  vector  methods  differ  from  each  other  only  in  the  method  of  edge  and  scan- 
segment  interpolation.  An  indication  of  the  value  of  each  method  can  be  determined  by 
timing  edge  interpolation  with  and  without  height  buckets,  and  by  timing  scan-segment 
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interpolation  with  and  without  scan-segment  buckets.  The  execution  time  for  rotation, 
scaling,  and  translation,  world  to  eye  coordinate  conversion,  miscellaneous  overhead,  and 
trapezoidal  decomposition  for  all  four  vector  methods  remains  the  same. 


4-S.7  Timings  for  the  F-16  This  section  presents  the  timing  results  for  displaying 
the  F-16.  An  asterisk  indicates  that  the  code  was  vectorized. 


4. 2. 7.1  Scalar  Timing  -  no  clipping 


Rotation,  scaling,  and  translation  .  861.87  ps 

World  to  eye  coordinate  conversion  .  21.50  ms 

Total  of  above  and  overhead  .  30.00  ms 

Miscellaneous  display  overhead  .  201.00  ms 

Trapezoidal  Decomposition  . 0.00  ms 

Polygon  edge  interpolation  .  248.00  ms 

Scan-segment  interpolation  .  341.00  ms 

Total  display  time  .  790.00  ms 

Total .  820.00  ms 


4.2.7. 2  Vector  Timing  -  no  clipping 

*Rotation,  scaling,  and  translation  .  066.88  ps 

*  World  to  eye  coordinate  conversion  .  1.00  ms 

Total  of  above  and  overhead  .  11.50  ms 

Miscellaneous  display  overhead  .  204.00  ms 

Trapezoidal  Decomposition  .  145.00  ms 

’"Trapezoid  edge  interpolation  (buckets)  .  132.00  ms 

*Trapezoid  edge  interpolation  (no  buckets)  .  157.10  ms 

*Scan-segment  interpolation  (buckets)  .  483.00  ms 

*Scan-8egment  interpolation  (no  buckets)  .  507.80  ms 

Method  1  total  display  time  .  964  ms 

Method  2  total  display  time  .  989  ms 

Method  3  total  display  time  .  989  ms 

Method  4  total  display  time  .  1014  ms 
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Section 

Scalar 

Vector 

Speedup 

^Rotation  etc. 

861.87  fj.s 

066.80  fis 

«  1%) 

12.90 

*World  to  eye  conversion 

21.50  ms 

ss  1.00  ms 

(<  1%) 

21.50 

Above  and  overhead 

30.00  ms 

11.50  ms 

(1.2%) 

2.61 

Miscellaneous  overhead 

201.00  ms 

204.00  ms 

(21.2%) 

0.99 

Trapezoidal  Decomposition 

0.00  ms 

145.00  ms 

(15.0%) 

0.00 

*Edge  interpolation 

248.00  ms 

132.00  ms 

(13.7%) 

1.89 

*Scan-seg  interpolation 

341.00  ms 

483.00  ms 

(50.1%) 

0.71 

Totcil  display  time 

790.00  ms 

964.00  ms 

0.82 

Table  4.5.  Speedup  Table  for  F-16  Display 

4. 2. 7.3  Summary  of  F-16  Timings  Unpredictably,  the  scadar  version  of  the 
program  executed  faster  than  any  of  the  vectorized  versions.  This  result  is  particularly 
perplexing,  especially  since  vectorization  did  produce  some  localized  speedups.  The  scalar 
version  (without  clipping)  took  790  ms  to  display  the  F-16,  while  the  fastest  fully  vectorized 
version,  vector  method  one  took  964  ms,  for  a  speedup  of  0.820.  Because  speedup  was  not 
achieved,  it  was  necessary  to  determine  where  the  vectorized  code  performed  more  poorly 
than  the  scalar  code.  Table  4.5  lists  the  performance  and  speedup  of  the  scalar  code  over 
the  vector  code,  and  the  percentage  of  time  consumed  by  each  section  of  the  vectorized 
code. 


Table  4.5  indicates  that  there  are  at  least  two  aretis  which  require  further  attention: 
trapezoidal  decomposition  and  scan-segment  interpolation.  Trapezoidal  decomposition  is 
not  required  in  the  scalar  version  but  is  essential  for  interpolating  polygon  edges  in  vec¬ 
tor.  Notice  that  trapezoidal  decomposition  is  not  implemented  in  vector,  and  cannot  be 
effectively  vectorized  unless  all  polygons  consist  of  a  certain  number  of  edges.  Trapezoidal 
decomposition  could  be  vectorized  if  all  polygons  were  triangles,  requiring  all  input  poly¬ 
gons  to  be  decomposed  into  triangles  (Conley,  Jul  88).  Converting  the  polygons  to  triangles 
could  occur  in  world  space  once,  and  never  again.  Except  for  the  generation  of  the  first 
frame  of  an  animation,  polygon  to  triangle  conversion  would  not  be  performed.  This  time 
is  likely  to  be  a  small  percentage  of  the  total  display  time  and  should  not  appreciably  affect 
the  ability  to  achieve  real-time  performance.  Once  the  triangles  were  generated,  they  could 
be  decomposed  into  trapezoids  in  vector  mode. 
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The  other  area  which  must  be  considered  is  the  scan-segment  interpolation,  which 
consumes  around  50%  of  the  display  time.  At  first,  the  cause  of  this  performance  degra¬ 
dation  was  believed  to  be  the  duplication  of  scan-segments  along  trapezoid  edge  borders. 
Scan-segment  duplication  occurs  because  the  bottom  scan-segment  of  one  trapezoid  is 
identical  to  the  top  scan-segment  of  its  lower  neighboring  trapezoid.  After  analyzing  the 
F-16  statistics  provided  in  Tables  4.1  and  4.2,  the  number  of  duplicate  pixels  which  must 
be  interpolated  is  only  12%  of  the  total.  Since  the  amount  of  time  to  interpolate  a  scan- 
segment  is  proportional  to  the  number  of  pixels  covered,  it  is  reasonable  to  expect  only  a 
12%  increase  in  the  scan-segment  interpolation  time.  Furthermore,  the  average  polygon  is 
only  64.52  pixels,  and  the  ratio  of  duplicate  scan-segments  is  less  when  the  polygons  are 
larger,  since  the  number  of  duplicate  scan-segments  increases  with  the  perimeter  of  the 
polygon,  while  the  total  number  of  pixels  increases  with  the  area  of  the  polygon.  Due  to 
the  small  average  size  of  the  polygons,  a  12%  duplication  of  pixels  is  likely  to  be  a  large 
ratio.  Consider  the  “Dodecalcosahederon,”  with  an  average  polygon  size  of  42,731  pixels. 
The  number  of  pixels  occurring  within  duplicate  scan-segments  is  only  1.2%  of  the  total. 

Eliminating  the  duplicate  scan-segments  requires  modifying  the  trapezoidal  decom¬ 
position  routine,  which  would  certainly  result  in  longer  execution  time  for  this  routine. 
Thus,  the  savings  from  not  interpolating  duplicate  scan-segments  must  be  greater  than 
the  additional  time  spent  eliminating  these  segments.  With  such  a  low  percentage  of  time 
spent  interpolating  duplicate  scan-segments,  a  modification  of  the  trapezoid  generation 
code  was  deemed  inefficient. 

Unlike  the  scan-segment  interpolation,  edge  interpolation  benefited  from  vectoriza- 
tion.  Bucketized  edge  interpolation  took  132  ms,  and  achieved  a  speedup  of  almost  two 
times  over  the  scalar  version.  Non-bucketized  edge  interpolation,  which  took  157.10  ms, 
was  faster  than  the  scalar  edge  interpolation,  but  slower  than  the  bucketized  edge  inter¬ 
polation. 

Most  encouraging  were  the  speedups  for  rotation,  scaling,  and  translation  and  world- 
to-eye  coordinate  transformations:  12.90  and  21.50  respectively.  Both  of  these  areas  involve 
extensive  use  of  matrix  algebra,  which  can  be  vectorized  rather  handily.  However,  both 
consume  such  a  small  portion  of  overall  execution  time,  around  0.1%,  that  the  speedup  had 


a  negligible  effect  upon  the  whole  program.  It  should  be 
in  one  area  can  mask  speedup,  reinforcing  the  importance 
code. 


evident  by  now  that  slowdown 
of  timing  individual  sections  of 


4.2.8  Timings  for  the  “Dodecalcosahederon”  This  section  presents  the  timing  re¬ 
sults  for  displaying  the  “Dodecalcosahederon.” 


4.2.8. 1  Scalar  Timing  -  no  clipping 


Rotation,  scaling,  and  translation .  <  1.00  ms 

World  to  eye  coordinate  conversion  .  <  1.00  ms 

Total  of  above  and  overhead  .  <  1.00  ms 

Polygon  edge  interpolation  .  11.00  ms 

Scan-segment  interpolation  .  726.40  ms 

Total  display  time  .  738.40  ms 

Total .  739.40  ms 


4. 2. 8. 2  Vector  Timing  •  no  clipping 


*Rotation,  scaling,  and  translation  .  <  1.00  ms 

*  World  to  eye  coordinate  conversion  .  <  1.00  ms 

Total  of  above  and  overhead  .  <  1.00  ms 

Trapezoidal  decomposition  .  <  1.00  ms 

*Trapezoid  edge  interpolation  (buckets)  .  25.47  ms 

*Trapezoid  edge  interpolation  (no  buckets)  .  11,10  ms 

*Scan-segment  interpolation  (buckets)  .  1158.60  ms 

*Scan-segment  interpolation  (no  buckets)  .  1029.80  ms 

Method  1  total  display  time  .  1185  ms 

Method  2  total  display  time  .  1056  ms 

Method  3  total  display  time  .  1171  ms 

Method  4  total  display  time  .  1042  ms 
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Section 

Scalar 

Vector 

Speedup 

^Rotation  etc. 

<  1.00  ms 

<  1.00  ms 

«  1%) 

*World  to  eye  conversion 

<  1.00  ms 

<  1.00  ms 

(<  1%) 

Above  and  overhead 

<  1.00  ms 

<  1.00  ms 

(<  1%) 

Miscellaneous  overhead 

<  1.00  ms 

<  1.00  ms 

(<  1%) 

Trapezoidal  Decomposition 

<  1.00  ms 

<  1.00  ms 

(<  1%) 

*Edge  interpolation 

11.00  ms 

11.10  ms 

(01.1%) 

0.99 

*Scan-seg  interpolation 

726.40  ms 

1029.80  ms 

(98.9%) 

0.71 

Total  display  time 

738.40  ms 

1041.90  ms 

0.71 

Table  4.6.  Speedup  Table  for  “Dodecalcosahederon” 


4.2.9  Summary  of  “Dodecalcosahederon  Timings’  Similar  to  the  F-16  timing  re¬ 
sults,  the  scalar  version  of  the  program  executed  faster  than  any  of  the  full  vectorized 
versions.  There  were  no  measurable  localized  speedups.  Table  4.6  lists  the  performance 
and  speedup  of  the  scalar  code  over  the  vector  method  four  code.  Note  that  almost  all  of 
the  time  was  consumed  by  scan-segment  interpolation  (almost  99%),  and  the  time  spent 
in  trapezoidal  decomposition  is  negligible.  Since  the  number  of  polygons  composing  the 
“Dodecalcosahederon”  is  small,  trapezoidal  decomposition,  as  well  as  coordinate  transfor¬ 
mation,  perspective  division,  and  lighting  calculations  do  not  consume  much  time. 


4.3  Heuristics  for  Vectorization 

4.3.1  Introduction  An  heuristic  is  a  rule  of  thumb,  or  an  educated  guess  used  for 
problem  solving.  This  section  investigates  rules  which  generally  result  in  the  speedup  of 
vectorized  code  over  scalar  code  and  also  discusses  the  advantages  and  disadvantages  of 
vectorization,  and  follows  this  with  twenty  heuristics  for  vectorization. 

4.3.2  Advantages  and  Disadvantages  of  Vectorization  Effective  vectorization  is  still 
a  black  art,  and  requires  some  of  the  most  difficult  design  and  coding  that  I  have  experi¬ 
enced,  even  more  difficult  than  parallelizing  code  for  implementation  on  several  processors. 
When  vectorizing,  not  only  is  the  computer  scientist  responsible  for  efficient  algorithm  de¬ 
sign,  but  he  must  target  this  design  to  a  specific  architecture.  In  the  absence  of  a  superb 
automatic  vectorizing  compiler,  vectorization  without  a  good  understanding  of  the  target 
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architecture  is  likely  to  go  awry.  Vectorization  is  very  much  a  trial  and  error  process,  and 
sometimes  it  defies  intuition.  It  is  generally  advantageous  to  try  several  different  vector 
alternatives,  which  was  the  intent  of  describing  the  four  different  vector  methods  presented 
in  Chapter  3.  Predicting  which  among  several  different  vector  alternatives  is  likely  to  be 
more  efficient  may  not  work. 

Poor  vectorizing  compilers  also  quickly  frustrate  the  diligent  vectorizer,  but  compilers 
are  like  any  other  software  product,  and  may  be  inefficiently  designed.  While  automatic 
vectorizing  compilers  will  improve,  they  do  not  compensate  for  poor  algorithm  design,  and 
for  best  results,  the  computer  scientist  must  target  an  algorithm  for  either  scalar  or  vector 
operation.  Nonetheless,  despite  the  difficulties  inherent  in  vectorization,  it  can  provide 
tremendous  speedups  when  applied  to  the  appropriate  problems  and  when  done  properly. 

Like  all  methods  for  improving  code  efficiency,  the  advantages  must  be  weighed 
against  the  disadvantages.  There  are  several  areas  ageunst  which  vectorization  must  be 
evaluated  before  a  designer  should  determine  to  vectorize  code.  Most  importantly,  vec¬ 
torized  software  must  not  transform  well  designed  scalar  code  into  an  unintelligible  mess. 
David  Fisher  lists  symptoms  of  poor  software.  He  states:  “symptoms  of  the  software  crisis 
appear  in  the  form  of  software  that  is  non-responsive  to  user  needs,  unreliable,  excessively 
expensive,  untimely,  difficult  to  maintain,  and  not  reusable  (Fisher,  1978:26).”  It  would  be 
foolish  to  evaluate  the  merits  of  vectorization  without  directing  attention  at  these  areas. 
Five  areas:  reliability,  cost,  maintainability,  portability,  and  efficiency  are  discussed  below. 
Note  that  these  five  areas  apply  for  code  which  is  explicitly  vectorized,  and  not  for  code 
which  is  written  in  scalar  form  and  automatically  vectorized  by  the  compiler. 


1.  Reliability.  Vectorized  software  can  certsunly  be  as  reliable  as  scalar  software,  but 
because  it  is  usually  more  difficult  to  design  and  code,  reliable  vectorized  software  is 
likely  to  be  more  difficult  to  achieve  than  reliable  scalar  software. 

2.  Coat.  Generally,  software  developed  to  exploit  a  computer’s  architecture  takes  longer 
to  develop  than  software  which  is  developed  for  the  general  case.  Since  software  cost 
is  directly  proportional  to  the  time  spent  developing  it,  and  vector  software  generally 
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takes  longer  to  develop  than  its  scalar  equivalent,  vector  software  will  probably  more 
expensive  than  scalar  software. 

3.  Maintainability  and  readability.  This  area  is  where  vectorized  software  can  be  a  li¬ 
ability.  Vector  software  is  usually  more  difficult  to  read  than  scalar  software,  and 
what  the  software  is  doing  may  not  be  easily  apparent  from  the  code.  It  requires  data 
structures  which  are  designed  for  vectorization  efficiency  rather  than  data  structures 
which  are  designed  to  be  an  abstraction  of  the  problem.  Less  readable  software  is 
generally  less  maintainable  software,  since  the  muntenance  programmer  must  un¬ 
derstand  the  software  before  it  can  be  modified.  Vector  software  also  contributes  to 
the  proliferation  of  programming  language  dialects,  as  vector  syntax  must  be  sup¬ 
ported  in  the  programming  language  when  explicit  vectorization  is  a  desired  feature. 
With  vectorization  a  known  feature  for  enhancing  code  efficiency,  new  programming 
languages  should  incorporate  vector  syntax  as  part  of  the  language.  For  computers 
which  do  not  have  vector  architectures,  compilers  could  convert  the  vector  syntax 
into  a  sequence  of  equivalent  scalar  instructions.  Ada  compilers  do  this  when  imple¬ 
menting  slice  instructions. 

4.  Portability.  Vectorized  software  is  unlikely  to  be  portable  from  one  computer  archi¬ 
tecture  to  another,  since  effective  vectorization  targets  computer  architecture  fea¬ 
tures. 

5.  Efficiency.  This  is  where  vectorized  software  has  its  primary  advantage  over  scalar 
softw’are.  Sometimes  the  computer  scientist  hzis  no  choice  but  to  radically  improve 
software  to  achieve  particular  timing  goals,  like  real-time  display.  He  can  resort  to 
scalar  assembly  code  programming  to  achieve  this  end,  or  vectorization  may  be  an 
option. 

4.S.3  Twenty  Heuristics  for  Producing  Vectorized  Software  Some  of  the  heuristics 
which  are  presented  here  are  applicable  only  to  the  Pascal  implementation  on  the  Cray-2. 
Many  of  them,  however,  apply  regardless  of  the  machine  for  which  the  software  is  targeted. 

1.  Implement  in  scalar  first  and  vectorize  only  when  necessary.  Implementing  the  scalair 
version  is  usually  less  labor  intensive  than  implementing  the  equivalent  vector  ver- 
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sion.  The  scalar  version  is  also  more  likely  to  be  more  maintainable  and  readable. 
Furthermore,  the  scalar  version  gives  the  designer  an  idea  of  how  the  problem  can 
be  solved,  and  if  the  scalar  version  does  not  meet  performance  requirements  then  it 
can  be  vectorized.  Should  vectorization  become  necessary,  then  the  algorithm  and 
data  structures  used  in  the  scalar  version  may  need  to  change.  However,  if  the  scalar 
version  meets  performance  requirements,  the  disadvantages  of  vectorization  dictate 
that  the  code  not  be  vectorized.  Yourdon  notes  that  “since  only  a  small  portion  of 
code  is  significant  in  terms  of  efficiency,  the  strategy  should  be  to  get  the  system 
working,  then  isolate  the  inefficient  parts  and  optimize  them  (Yourdon, 1979  :  102).” 

2.  Try  porting  to  a  faster  machine.  In  the  event  that  performance  requirements  are  not 
met  and  if  another  machine  is  available  with  a  faster  processor,  faster  memory,  or 
a  better  compiler,  port  the  scalar  code  to  that  machine.  The  faster  machine  may 
provide  the  necessary  speedup.  Basically,  the  goal  is  to  keep  the  code  as  simple  and 
readable  as  possible.  Investigate  the  possibility  of  parallelizing  code,  if  a  parallel 
solution  is  straight-forward  and  easier  to  implement  than  a  vector  solution. 

3.  Make  use  of  the  vectorizing  compiler.  Some  vectorizing  compilers  do  a  superb  job 
of  automatically  vectorizing  “for  loops,”  and  if  they  are  used,  a  minimal  knowledge 
of  the  computer  architecture  is  necessary.  (Preliminary  studies  performed  by  Bob 
Conley  at  the  Air  Force  Weapons  Lab  indicate  that  the  Cray-2  FORTRAN  compiler 
is  better  at  automatically  vectorizing  “for  loops”  than  using  FORTRAN  explicit 
array  syntax,  which  is  not  the  case  for  the  Cray-2  Pascal  compiler.)  Write  the  code 
in  scalar  form,  and  allow  the  compiler  to  vectorize  the  “for  loops.”  This  does  not 
eliminate  the  requirement  to  write  “for  loops”  in  a  form  that  permit  vectorization. 
Remember  that  procedure  calls  and  “if”  statements  may  defeat  vectorization,  and 
thus  some  modularity  may  have  to  be  sacrificed  for  efficiency. 

4.  Watch  out  for  errors  in  compiler  implementations.  Vectorizing  compilers  are  difficult 
to  write  and  may  not  correctly  implement  all  features  of  the  vector  syntax  as  indi¬ 
cated  in  the  reference  manual.  The  Cray-2  Pascal  compiler  is  a  relatively  immature 
compiler,  and  in  the  course  of  this  thesis,  several  errors  were  discovered. 


(a)  One  of  the  errors  discovered  involves  the  use  of  the  array  merge  syntax  in  the 
scan-segment  interpolation  routine; 

inage' [cel [1 . .nscans]] 

if  deplCl . .nscans]  <  zbuff [cel [1. .nscans]]  then 
polcol [1 . .nscans] 
else  image' [cel [1 . .nscans]]  ; 

This  code  causes  a  run-time  error  (memory  interrupt)  when  executed,  “image'” 
is  a  packed  array  containing  pixel  color  values.  This  statement  had  to  be  re¬ 
placed  by: 

for  j  :•  1  to  nscans  do 

if  depl[j]  <  zbuff [cel[j]]  then 
image' [cel [j]]  :«polcol[j]; 

The  compiler  cannot  vectorize  packed  array  structures.  However,  when  vector 
syntax  is  used  with  a  packed  array,  the  compiler  should  always  disable  vector- 
ization  and  replace  the  vector  implementation  with  the  scalar  equivalent. 

(b)  The  compiler  also  seems  to  be  tricked  by  nested  record  structures.  Examine 
the  record  declared  in  Figure  4.3.  Accessing  “vli8t[l.  .numv]  .w.x”  produced 
incorrect  values,  but  the  semantic  equivalent  “vli8t[l.  .numv]  .a [3]”  produced 
the  correct  values. 

5.  Design  data  structures  carefully,  paying  attention  to  stride.  Design  data  structures 
with  a  stride  of  one  between  accessed  elements.  A  vector  memory  access  instruction 
attempts  to  grab,  for  instance,  64  elements  from  memory.  By  keeping  the  stride  low, 
and  avoiding  strides  that  are  powers  of  two,  efficiency  is  enhanced. 

(a)  Compare  the  data  structure  in  Figure  3.4,  Trapezoid  Height  Bucket  Data  Struc¬ 
ture,  which  has  a  stride  of  one  between  successive  variable  references,  (for  ex¬ 
ample,  element  one  of  “xltop”  to  element  two  of  “xltop”)  to  the  data  structure 
in  Figure  4.4,  which  has  a  stride  of  11  between  successive  elements.  The  data 
structure  in  Figure  3.4  is  the  more  desirable. 
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const  vmax 

=  32767; 

{  Max  num  of  ctl  pts 

> 

type 

point3d  = 

record  x, 

y,2  :  real;  end; 

vertex  « 

record 

case  t  : 

boolean  of 

true 

:  ( 

cf 

:  boolean; 

{  Color  flag 

} 

nf 

:  boolean; 

{  Normal  flag 

> 

w 

:  pointSd; 

{  World  space  triple 

> 

e 

:  point3d; 

{  Eye  space  triple 

> 

col 

;  tristim; 

{  Color  tristimulus 

> 

nrm 

:  point3d); 

{  Surface  normal 

> 

false 

:  ( 

a 

:  array [1.. 

14]  of  real) ; 

end; 

{The  a  array  in  the  false  variant  part  overlays} 
{  the  elements  declared  in  the  true  part  } 

varr  «  array  [0..vmax]  of  vertex; 

var 

vlist  :  varr; 
numv  :  integer; 


Figure  4.3.  Nested  Record  Data  Structure 
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type 

color  =  0. .4294967295; 


heightbucket  *  record 
case  t  :  boolean  of 
true 


(xltop 

:  real; 

{top  left  X  value} 

ytop 

:  integer: 

{top  y  value} 

zltop. 

{top  left  z  value} 

xlbot 

:  real ; 

{bottom  left  x  value} 

ybot 

:  integer; 

{bottom  y  value} 

zlbot , 

{bottom  left  z  value} 

xrtop , 

{top  right  X  value} 

zrtop , 

{top  right  z  value} 

xrbot , 

{bottom  right  x  value} 

zrbot 

:  real; 

{bottom  right  z  value} 

polcol 

:  color) ; 

{trapezoid  color} 

false  :  (tb : array [1. .11]  of  integer): 
end  {record} ; 

heightarr  *  airrayCO. .  1023]  of  heightbucket; 
heightcnt  =  array [0 .. 1023]  of  integer; 

var 

hgtbucket  :  heightarr; 
hgtcnt  :  heightcnt; 


Figure  4.4.  Alternate  Trapezoid  Height  Bucket  Data  Structure 
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(b)  In  a  similar  case,  two  different  trapezoid  height  bucket  record  structures  were 
constructed,  one  with  a  stride  of  one  and  one  with  a  stride  of  35.  The  program 
using  the  data  structure  with  a  stride  of  one  executed  7%  faster  than  the  other. 


6.  Be  aware  of  multidimensional  array  access  order.  Different  programming  languages 
access  array  elements  differently.  Contiguous  elements  in  memory  are  referenced 
by  incrementing  the  leftmost  array  subscript  in  a  FORTRAN  array,  while  Pascal 
contiguous  elements  in  memory  are  those  referenced  by  successive  indices  in  the 
rightmost  part  of  the  array.  In  other  words,  when  accessing  FORTRAN  arrays, 
increment  the  leftmost  array  subscript  most  frequently,  and  when  accessing  Pascal 
arrays,  increment  the  rightmost  array  subscript  most  frequently.  For  example,  the 
following  loop  structure  ensures  a  stride  of  one  through  the  Pascal  array: 

type  aarray  *  array [1 . ,256, 1 . .64]  of  real; 

var 

a,  b  :  aarray; 

i,  j  :  integer; 

{equivalent  to  a [1. .256,1. .64,1]  :«  b[l. .256,1. .64] ;} 

for  i  :»  1  to  256  do 
for  j  :=  1  to  64  do 
a[i,j]  :=  b[i,j]; 


However,  the  following  loop  structure  forces  a  stride  of  256,  the  worst  possible: 


for  i  :»  1  to  64  do 
for  j  :»  1  to  256  do 
aCj.i]  :*  bCj.i]  ; 


7.  Minimize  memory  access.  Retrieving  large  vectors  from  memory  and  storing  large 
vectors  can  be  time  consuming.  Try  to  perform  all  necessary  operations  on  a  partic¬ 
ular  vector  all  at  once,  and  store  it  at  the  end.  In  other  words,  minimize  the  use  of 
temporary  variables.  Given  the  following  statements: 
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type  aarray  ®  array  [1..256]  of  real; 


var 

a,  b,  c 

i.  j 


aarray ; 
integer; 


a  :*  b  *  j  ; 
c  ;=  a  ♦  i; 


If  the  optimizer  does  not  catch  the  presence  of  the  unnecessary  store  of  “a,”  then  the 
two  above  assignments  should  be  replaced  with  “c  :*  b  *  j  ♦  i;.” 

8.  Do  not  use  call  by  value  parameters.  Modern  software  engineering  practices  specify 
using  call  by  value  formal  parameters  when  the  actual  parameter  value  is  not  to  be 
changed,  eliminating  the  possibility  of  unintentionally  changing  the  actual  parameter. 
Call  by  value  parameters  require  a  copy  of  the  actual  parameter  to  be  made  at 
the  procedure  call.  For  parameters  which  consume  small  amounts  of  memory,  the 
overhead  of  making  copies  is  negligible.  However,  for  large  vectors,  this  overhead  is 
significant.  For  large  vectors,  use  call  by  reference  parameters.  A  copy  of  the  vector 
is  not  made,  but  instead,  a  pointer  to  the  vector  is  passed  to  the  called  procedure. 
In  Pascal,  call  by  value  is  the  default  parameter  passing  mode,  and  call  by  reference 
is  specified  by  using  “var”  in  the  formal  parameter  list.  Originally,  the  z-buffer 
program  passed  the  world  coordinate  vector,  a  vector  with  8000  elements,  to  the 
world-to-eye  coordinate  conversion  procedure  by  value.  It  took  35.5  ms  to  perform 
the  conversion.  When  this  vector  was  passed  using  call  by  reference,  the  procedure 
took  around  1  ms. 

9.  Work  with  large  arrays.  The  Cray-2’8  vector  registers  can  hold  64  64-bit  vectors. 
When  possible,  save  up  elements  until  there  are  64.  This  was  the  purpose  of  the 
height  (edge)  buckets  and  the  length  (scan-segment)  buckets  used  in  the  z-buffer 
program.  Edges  or  scan-segments  of  the  same  length  were  accumulated  until  there 
were  64.  Then  the  buckets  were  emptied  and  the  vector  operations  were  performed. 
Generally,  the  larger  the  array,  the  better. 
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Note,  however,  that  using  buckets  is  not  always  advantageous.  They  improve  perfor¬ 
mance  only  when  the  extra  processing  required  to  fill  them  is  offset  by  the  efficiency 
gains  of  having  the  Cray-2  operate  on  large  vectors.  In  many  cases,  this  extra  process¬ 
ing  to  fill  the  buckets  must  proceed  in  scalar  mode,  and  consequently,  performance 
suffers.  While  the  F-16  display  time  was  faster  for  bucketized  interpolation  than 
for  non-bucketized  interpolation,  the  “Dodecalcosahederon”  display  time  was  faster 
when  non-bucketized  interpolation  was  used. 

10.  Avoid  the  requirement  for  type  conversions.  When  the  right  side  of  an  assignment 
statement  contains  a  variable  of  a  different  type  than  the  left  side  the  statement, 
a  type  conversion  must  occur.  Type  conversions  slow  vector  operations.  Generally, 
this  is  not  a  problem,  since  Pascal  does  not  permit  incompatible  type  assignments. 
However,  Pascal  permits  the  assignment  of  an  integer  variable  to  a  real  variable.  For 
example,  suppose  the  following  vector  operation  is  performed: 

pixdepCO. .length]  ;«  depl  ♦  pixpos[0. .length]  *  ddep; 

and  “pixdep,”  “depl,”  and  “ddep”  are  real  variables,  while  “pixpos”  is  an  integer 
array.  “Pixpos”  must  undergo  a  type  conversion  from  integer  to  real.  By  varying 
“length”  from  0  to  100  with  “pixpos”  declared  an  integer  array,  the  statement 
executed  22%  to  53%  worse  than  if  “pixpos”  were  declared  a  real  array;  and  as 
“length”  grew,  the  statement  with  the  integer  “pixpos”  performance  worsened. 

11.  Time  individual  sections  of  code.  The  rotation,  scaling,  translation  and  world-to-eye 
coordinate  conversion  sections  of  code  consume  such  a  small  portion  of  the  total 
processing  time  that  their  effect  on  the  efficiency  of  the  whole  program  is  negligible, 
reinforcing  the  importance  of  timing  individual  sections  of  code  to  find  where  sig¬ 
nificant  portions  of  time  are  spent  and  concentrating  on  vectorizing  them.  Indeed, 
the  old  axiom  that  10%  of  the  code  takes  90%  of  the  time  seems  to  apply  (Yourdon, 


12.  Vectorization  usually  requires  extra  processing.  Most  vectorization  requires  some 
overhead  processing  which  is  not  required  for  the  equivalent  scalar  implementation. 
The  time  taken  for  overhead  processing  must  be  offset  by  the  speedup  achieved  by 
vectorization.  For  the  vector  z-buffer,  the  decomposition  of  polygons  into  trapezoids, 
not  required  for  the  scalar  z-buffer,  was  pure  overhead  processing  which  enabled  vec¬ 
torized  edge  interpolation.  Unfortunately,  the  speedup  achieved  did  not  compensate 
for  the  extra  processing  required. 

13.  Understand  the  order  of  “if  statement”  evaluations.  Array  merges  cause  problems 
when  there  are  exceptional  conditions. 

(a)  For  example,  the  following  statement  causes  an  execution  error  -  “divide  by 
zero”  when  any  element  of  “dpix”  is  zero: 

ddep [1 . .ntraps]  :* 
if  dpix [1 . .ntraps]  >  0  then 
deprCl. . ntraps] -deplCl. .ntraps] /dpix [1. .ntraps] 
else  0.0; 

The  “if  depr[l. .ntraps] -deplCl. .ntraps] /dpix [1. .ntraps]”  and  “else 
0.0”  clauses  are  evaluated  first;  then  the  “if  dpixCl.  .ntraps]  >  0”  is  eval¬ 
uated  to  determine  which  results  to  place  in  the  left  side  of  the  expression. 

(b)  The  following  is  a  method  to  treat  exceptionad  conditions: 

if  All (k [1 . .length]  <>  0.0)  then 
rCl., length]  :=  1.0  /  k[l.. length] 
else  Halt; 

If  any  of  the  elements  of  “k”  is  0.0,  then  the  program  halts,  preventing  division 
by  zero.  However,  if  special  processing  is  required  for  any  elements  which  are 
0.0,  instead  of  halting  the  program,  then  each  element  in  the  “k”  array  must  be 
individually  inspected  in  scalar  form. 

14.  Avoid  data  dependencies.  Expressions  with  data  dependencies  cannot  be  vectorized 
reliably.  For  instance,  the  expression; 


hgt cut [ index [1. .length]]  :=  hgtcnt [index [1. .length]]  +  1; 


does  not  give  reliable  results  when  two  or  more  elements  of  “index [1 .  .length]  ”  are 
identical.  If  “index"  contains  the  values  “1,  3,  4,  and  1",  then  we  would  expect 
“hgtcntfl]”  to  be  incremented  by  “2”  and  “hgtcnt[3]”  and  “hgtcnt[4]”  to  be 
incremented  by  “1”.  However,  this  does  not  occur  when  the  value  to  be  stored  in 
“hgtcntfl]”  does  not  have  time  to  complete  the  vector  operation  and  be  stored 
before  the  old  value  of  “hgtcntfl]”  enters  the  pipeline. 

15.  Enable  range  checking  to  minimize  debugging.  The  Pascal  compiler  has  an  array  index 
range  checking  capability  (the  r+  option)  which  prevents  array  indices  from  exceeding 
their  bounds.  Range  checking  must  be  disabled  for  vectorization  to  occur.  When 
developing  code,  enable  range  checking,  otherwise  array  bounds  may  be  exceeded 
without  a  run-time  error.  Once  the  program  generates  the  correct  results,  disable 
range  checking  so  the  code  can  be  vectorized. 

(a)  For  example,  when  range  checking  is  disabled,  the  following  potential  run-time 
error,  triggered  when  “ntraps”  is  greater  than  64,  may  not  be  caught: 

type  mat64rea  *  arrayfl..64]  of  real; 
var  dxl,  xltop,  xlbot  :  mat64rea; 

dxlfl . .ntraps]  :=  xltopfl . .ntraps]  -  xlbot fl . .ntraps] ; 

(b)  For  an  indication  of  how  much  quicker  a  section  of  code  can  run  with  range 
checking  disabled  (and  vectorization  enabled)  as  opposed  to  range  checking 
enabled,  consider  the  vectorization  of  “Make  trapezoid  buckets”  (MTB)  and 
“Empty  trapezoid  buckets”  (ETB).  With  range  checking  disabled,  MTB  took 
72.08  ms  to  execute,  while  ETB  took  60.89  ms.  With  range  checking  enabled, 
MTB  took  232.00  ms  to  execute,  while  ETB  took  293.28  ms.  With  range  check¬ 
ing  enabled,  MTB  and  ETB  took  almost  four  times  longer  to  execute. 
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16.  Use  unpacked  arrays.  Minimize  the  use  of  operations  on  packed  arrays,  which  the 
Pascal  compiler  cannot  vectorize.  Sometimes  packed  arrays  must  be  used,  however, 
and  a  solution  to  this  problem  is  to  operate  on  unpacked  arrays  and  pack  them  later. 
(The  Ultra  frame  buffer  requires  packed  image  arrays). 

17.  Take  advantage  of  large  memory  availability.  Most  supercomputers  have  an  enor¬ 
mous  memory  capacity.  Trade  memory  for  execution  time.  Avoid  dynamic  memory 
allocation  requiring  pointers,  since  the  stride  is  unpredictable,  and  use  large  array 
structures.  Use  buckets  when  appropriate  to  accumulate  vectors  on  which  to  operate. 

18.  Use  matrix  solutions  when  possible.  Computer  graphics  requires  matrix  algebra  for 
many  data  manipulations  (such  as  rotations,  scalings,  translations,  and  other  trans¬ 
formations).  Attempt  to  express  solutions  in  matrix  form,  if  one  exists,  before  vec¬ 
torizing.  Solutions  which  can  be  expressed  in  matrix  algebra  form  are  some  of  the 
easier  to  implement  in  vector. 

19.  Realize  that  results  are  usually  data  sensitive.  Just  because  a  vectorized  program 
produces  a  speedup  for  one  set  of  data,  it  may  not  produce  a  speedup  for  another 
set,  especially  if  one  set  of  data  is  characterized  by  long  vectors,  and  another  by  short 
vectors.  Collecting  statistics  on  data  characteristics  may  shed  light  on  why  speedup 
is  not  occurring. 

20.  Do  not  get  frustrated.  Vectorization  is  difficult,  and  requires  time  to  learn.  It  is  a 
new  programming  skill  which  is  perhaps  the  most  difficult  of  all  programming  skills 
to  master. 

4.4  High  Level  Design  for  a  New  Vectorized  Z-buffer  Algorithm 

From  experience  gained  in  vectorizing  the  z-buffer,  a  new  algorithm  is  presented 
which  should  provide  a  speedup  over  the  current  vectorized  z-buffer. 

1.  Read  the  object  and  polygon  descriptions. 

2.  Transform  polygon  local  coordinates  to  world  coordinates. 
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3.  Transform  polygons  to  triangles.  Place  triangles  to  be  flat  shaded,  Gouraud  shaded, 
and  Phong  shaded  into  3  separate  data  structures. 

4.  For  all  frames  to  be  generated  do  ^5  -  #14 

5.  *  Transform  the  triangles  to  the  proper  view  by  rotating,  scaling,  or  translating. 

6.  *  Transform  all  triangle  world  coordinates  to  eye  coordinates. 

7.  Clip  all  triangles  to  the  screen. 

8.  *  Decompose  all  triangles  into  trapezoids.  Each  triangle  is  decomposed  into  two 
trapezoids. 

9.  *  Perform  flat  or  Gouraud  or  Phong  shading  on  all  trapezoids. 

10.  *  Perform  perspective  division  on  all  trapezoids. 

11.  For  each  trapezoid  do: 

(a)  *  Place  trapezoids  into  height  buckets. 

(b)  If  the  current  bucket  contains  64  trapezoids,  empty  the  bucket: 

i.  *  Determine  the  right  and  left  edge  values  for  position,  (color,  and  normal 
if  Gouraud  or  Phong  shading  is  used)  for  all  the  trapezoids  in  the  bucket. 
Now,  each  trapezoid  in  the  bucket  has  been  decomposed  into  scan-segments. 

ii.  Place  all  scan-segments  into  a  bucket  for  later  interpolation. 

12.  Empty  all  height  buckets  which  are  not  empty. 

13.  *  Interpolate  all  scan-segments  in  vector. 

14.  Display  the  contents  of  the  frame-buffer  on  the  Ultra  display. 

Notice  that  steps  seven  through  ten  are  performed  on  all  the  triangles  at  once,  not  on 
polygons  taken  one  at  a  time  as  the  design  in  Chapter  3  shows.  This  allows  the  designer 
to  work  with  longer  vectors  than  before. 


Interpolation 

Pascal 

FORTRAN 

Speedup 

(no  buckets) 

F-16 

Dodeca 

F-16 

Dodeca 

F-16 

Dodeca 

Edge 

157.10 

11.10 

6.24 

0.11 

25.16 

100.91 

Scan-segment 

507.80 

1029.80 

215.82 

82.06 

2.35 

12.55 

Total 

664.90 

1040.90 

222.06 

82.17 

2.99 

12.67 

Table  4.7.  Pascal  and  FORTRAN  Interpolation  Timings  (ms) 

4.0  Conclusion 

Obviously,  the  Pascal  code  did  not  execute  quickly  enough  for  near  real-time  display. 
However,  it  does  not  follow  that  real-time  performance  is  not  possible  with  the  existing 
Pascal  compiler,  but  clearly,  the  code  must  be  restructured  if  real-time  performance  is  to 
be  achieved;  and  yet  restructuring  may  not  produce  the  needed  speedup.  One  option  is  to 
examine  the  Pseudo  Cray  Assembler  Language  (CAL)  produced  by  the  Pascal  compiler  for 
efficient  instruction  sequences.  Unfortunately,  the  pseudo  CAL  output  is  not  consistent 
with  the  actual  machine  instruction  sequence  produced  by  the  compiler  and  therefore 
analysis  of  the  pseudo  CAL  output  cannot  indicate  reliably  where  the  Pascal  compiler 
might  be  generating  inefficient  object  code  (Conley,  Sep  88). 

The  Pascal  z-buffer  indicates  that  the  most  time  consuming  portion  of  the  code  is  edge 
and  scan-segment  interpolation,  together  consuming  64%  of  total  display  time  for  the  vector 
version  (615  ms)  and  75%  of  the  total  display  time  for  the  scalar  version  (589  ms).  Despite 
numerous  attempts  at  restructuring  these  two  routines,  speedup  has  been  elusive,  so  these 
routines  were  recoded  in  FORTRAN,  given  Cray  FORTRAN  is  a  more  efficient  compiler 
than  the  Pascal  compiler.  Estimates  of  the  relative  performance  of  FORTRAN  to  Pascal 
can  be  obtained  by  recoding  edge  and  scan-segment  interpolation  routines  in  FORTRAN 
and  comparing  the  results  to  those  produced  by  the  Pascal  compiler.  While  it  is  difficult 
to  determine  whether  the  speedup  is  due  to  vectorization  or  due  to  using  the  more  efficient 
FORTRAN  compiler  instead  of  the  Pascal  compiler,  the  results  are  astounding.  Translating 
the  Pascal  edge  and  scan-segment  interpolation  routines  into  FORTRAN  resulted  in  the 
timings,  performed  for  both  the  F-16  and  the  “Dodecalcosahederon,”  shown  in  Table  4.7. 

The  FORTRAN  version  performed  the  edge  and  scan-segment  interpolation  routines 
in  222.06  ms  for  the  F-16.  Experimenting  with  the  FORTRAN  edge  and  scan-segment 
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interpolation  routines  by  modifying  data  structures  and  restructuring  the  control  flow  may 
produce  further  speedup.  With  such  superb  performance  obtained  merely  by  translating 
Pascal  into  FORTRAN,  the  other  sections  of  code  should  be  written  in  FORTRAN,  instead 
of  Pascal,  according  to  the  design  language  presented  in  this  chapter.  Nonetheless,  both 
of  these  routines  must  execute  in  less  than  100  ms  for  near  real-time  performance.  While 
100  ms  may  be  difficult  to  achieve  solely  by  vectorization  using  FORTRAN,  rewriting  the 
critical  z-buflfer  routines  in  Cray  assembler  may  produce  the  desired  speedup. 
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V.  Conclusions 


5.]  Introduction 

This  chapter  covers  the  recommendations  for  follow-on  thesis  research,  the  economics 
of  real-time  image  display,  and  concludes  by  addressing  the  thesis  statement  formulated  in 
Chapter  1. 

5.2  Recommendation  for  Future  Thesis  Research 
This  thesis  can  be  enhanced  in  at  least  two  ways: 

1.  Restructure  the  scalar  algorithm  to  permit  further  vectorization,  according  to  the 
z-buffer  high-level  design  presented  in  Chapter  4.  Use  the  FORTRAN  programming 
language. 

2.  Parallelize  the  z-buffer  software  by  using  all  four  of  the  Cray-2’s  processors. 

A  combination  of  vectorization  and  parallelization  is  likely  to  produce  speedups 
greater  than  either  can  achieve  alone.  Restructuring  the  scalar  algorithm  was  already 
sufficiently  discussed  in  Chapter  4.  A  parallel  z-buffer  algorithm  is  mentioned  in  at  least 
four  articles:  (Fiume  and  Fournier,  1983),  (Parke,  1980),  (Kaplan  and  Greenberg,  1979), 
and  (Weinberg,  1981).  Two  methods  for  implementing  the  code  in  parallel  are: 

1.  Processor  Pool. 

2.  Screen-space  Division. 

Each  is  discussed  briefly  in  the  following  sections,  with  the  intent  of  providing  ideas  for 
future  research. 

5.2.0. 1  Processor  Pool  One  of  the  four  processors  would  act  as  an  executive, 
parcelling  polygons  to  one  of  three  processors  in  the  available  processor  pool.  Polygon 
decomposition  into  trapezoids,  and  edge  and  scan-segment  interpolation  would  proceed 
normally.  When  these  calculations  cease,  the  processor  is  placed  back  into  the  available 
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processor  pool,  awaiting  assignment  by  the  executive.  The  maximum  speedup  for  this 
method  is  three.  A  major  disadvantage  to  this  method  results  from  shared  memory  con¬ 
flicts.  It  is  possible  that  two  or  more  processors  may  try  to  update  simultaneously  either 
the  depth-buffer  or  the  refresh-buffer,  and  preventing  all  but  one  of  the  processors  from 
simultaneous  access  slows  down  execution.  One  advantage  is  approximately  even  processor 
load. 

5. 2.0. 2  Screen  Space  Division  In  this  method,  the  screen  is  divided  into 
fourths  and  each  of  the  four  processors  is  assigned  responsibility  for  displaying  the  proper 
part  of  the  image  on  its  portion  of  the  screen.  Each  processor  is  given  a  copy  of  the  entire 
object  description,  and  it  clips  the  object  polygons  to  its  screen  portion.  This  method 
is  not  susceptible  to  several  processors  attempting  to  access  identical  memory  cells  in  the 
z-buffer  or  refresh  buffer,  since  each  processor  is  responsible  for  its  own  z-buffer  and  refresh 
buffer.  However,  it  is  vulnerable  to  uneven  processor  load.  If  the  object  occupies  only  a 
certain  part  of  the  screen,  one  or  more  of  the  processors  may  be  idle  for  significant  amounts 
of  time  (or  altogether  idle).  This  can  be  minimized  by  carefully  subdividing  the  screen 
(interleaving  processor  responsibilities  over  the  screen),  but  the  more  complex  the  screen 
shape  for  which  each  processor  is  responsible,  the  more  time  that  is  spent  clipping  the 
object  polygons  to  the  screen  shape.  This  method  also  complicates  antialiasing,  due  to  the 
communication  of  pixel  color  values  from  one  processor  to  another  for  those  pixels  which 
border  on  two  or  more  processors’  screen  area.  Another  consideration  is  that  this  method 
requires  four  copies  of  the  object  description,  which  can  occupy  a  significant  amount  of 
memory.  Maximum  speedup  due  to  this  type  of  parallelization  is  four. 

5. 2. 0.3  Options  for  Real-time  Display  Regardless  of  the  method  chosen,  par¬ 
allelization  may  negatively  affect  vectorization.  One  disadvantage  of  the  screen  space 
division  method  is  that  vector  length  decreases  when  scan-segments  are  broken  into  pieces 
for  interpolation  by  different  processors.  Additionally,  if  bucketized  edge  or  scan-segment 
interpolation  is  used,  the  number  of  height  and  length  buckets  quadruple  for  the  four 
processors  and  each  bucket  may  contain  fewer  scan-segments  to  interpolate  in  vector.  Fur¬ 
thermore,  the  maximum  speedup  is  four,  so  even  if  parallelization  is  exploited,  real-time 


performance  may  not  be  achieved.  For  a  ten  frame  per  second  image  display,  the  vectorized 
program  must  be  capable  of  displaying  the  image  in  no  more  than  400  ms,  and  a  speedup 
of  four  must  be  obtained  by  parallelization.  An  added  disadvantage  of  parallelization  is 
expense.  While  vectorization  attempts  to  use  each  processor  more  efficiently,  paralleliza¬ 
tion  requires  the  use  of  more  processor  resources,  which  may  be  perceived  unfavorable  by 
other  Cray-2  users  competing  for  these  resources. 

5.3  The  Economics  of  Real-time  Image  Display 

Examining  the  economics  of  generating  real-time  images  using  the  Cray-2  is  not  a 
primary  objective  of  this  thesis,  but  one  cannot  help  but  wonder  whether  the  attempt  is 
worthwhile.  Certainly  this  thesis  has  provided  some  experience  and  insight  into  vectorizing 
algorithms,  which  is  important  and  useful  information  if  further  research  of  real-time  image 
generation  using  the  Cray-2  is  continued.  What  is  not  so  obvious  is  whether  the  Cray-2 
should  be  used  for  real-time  image  generation.  Clearly,  using  the  Cray-2  for  this  purpose 
is  expensive,  especially  if  all  four  processors  are  used.  While  the  Cray-2  with  the  Ultra 
frame  buffer  is  an  enviable  tool  for  graphics  algorithm  development  and  research,  for  those 
scientists  not  in  the  business  of  creating  graphics  algorithms,  but  just  using  them,  the 
Cray-2  with  the  Ultra  frame  buffer  is  an  expensive  tool  for  graphics,  especially  w’ith  the 
advent  of  the  computer  graphics  workstation.  This  does  not  lessen  the  importance  of 
developing  efficient  algorithms  targeted  for  the  supercomputer,  for  there  may  be  no  option 
but  to  use  the  Cray-2,  and  speedy,  efficient  graphics  display  is  highly  desirable.  There  are 
cases  when  the  supercomputer  is  absolutely  essential  for  collecting  and  reducing  scientific 
data.  Downloading  this  data  to  a  computer  graphics  workstation  may  be  prohibitively 
expensive  and  there  may  be  no  other  option  than  to  use  the  Cray-2  for  generating  images. 

However,  excluding  the  economics  of  using  the  supercomputer  from  a  decision  to 
display  data  is  unwise.  Dedicated  use  of  a  multimillion  dollar  supercomputer  is  wasteful 
if  a  $100,000  workstation  can  do  the  job  in  the  required  time.  Reserving  the  Cray-2 
for  graphics  work  when  it  is  not  feasible  to  download  data  from  the  supercomputer  to  a 
graphics  workstation,  or  where  flexibility  in  changing  display  algorithm  is  desired,  should 
be  examined.  Otherwise,  consideration  should  be  given  to  using  another,  ess  expensive 
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computer. 

It  is  also  interesting  that  as  this  thesis  was  well  underway,  Titan,  Ardent,  and  Silicon 
Graphics  introduced  machines  that  display  images  in  real-time  using  the  z-buffer  algo¬ 
rithm;  at  a  relatively  low  cost  -  around  $100,000.  All  three  companies  used  vector  pipeline 
technology  for  initial  data  transformations,  perspective  division,  and  lighting,  but  in  all 
three  cases,  specific  VLSI  hardware  was  developed  for  both  edge  and  scan-segment  inter¬ 
polation.  Could  these  companies  have  discovered  already  that  general  purpose  hardware 
(vector  pipelines)  are  advantageous  only  for  coordinate  transformation,  perspective  divi¬ 
sion,  and  lighting  calculations,  while  specific  VLSI  graphics  engines  were  required  for  edge 
and  scan-segment  interpolation?  One  can  only  speculate.  However,  using  these  machines 
is  not  without  its  disadvantages.  Downloading  display  data  from  the  supercomputer  over 
low  bandwidths  to  the  workstations,  data  format  restrictions,  and  lack  of  flexibility  to 
change  the  display  software  are  significant  disadvantages  which  the  Cray-2  does  not  have. 

If  real-time  display  is  to  be  achieved,  a  combination  of  speedup  techniques  must  be 
used.  The  following  are  options  which  may  provide  this  speedup  if  correctly  implemented: 

1.  Effective  vectorization  of  the  z-buffer  code. 

2.  Coding  the  algorithm  in  FORTRAN  or  perhaps  assembler. 

3.  Parallelization  of  the  z-buffer  software. 

4.  Replacing  the  z-buffer  algorithm  with  another,  more  appropriate  hidden-surface  al¬ 
gorithm  that  is  better  suited  to  the  Cray-2  architecture. 

5.^  Conclusion 

The  purpose  of  this  thesis  was  to  gain  experience  and  insight  into  vectorizing  soft¬ 
ware.  The  heuristics  presented  in  Chapter  4  can  be  applied  to  the  vectorization  of  other 
computer  software.  One  important  observation  is  that  explicit  vectorization  can  obscure 
the  clear  function  of  software  and  decreases  software  readability.  Thus,  transforming  scalar 
software  into  vector  software  using  explicit  vector  syntax  should  be  done  sparingly,  only 
when  absolutely  essential  to  meet  performance  requirements.  However,  efficient  \<(  lorizing 
compilers  should  be  used  to  automatically  vectorize  software. 


This  thesis  has  shown  that  vectorization  is  a  powerful  means  of  increasing  software 
efficiency.  While  real-time  performance  was  not  achieved,  it  does  not  follow  that  it  is  not 
possible,  but  it  has  been  demonstrated  that  it  is  difficult  to  achieve.  Consideration  must 
be  given  to:  1)  compiler  efficiency,  2)  supercomputer  architecture,  3)  type  of  data  to  be 
displayed,  and  4)  display  algorithm.  Clearly,  when  attempting  to  design  efficient  software 
if  any  one  of  these  four  areas  is  not  optimized,  the  entire  display  performance  will  suffer. 
Originally,  the  code  was  developed  with  the  assumption  that  the  Pascal  compiler  could 
be  relied  upon  to  generate  efficient  code,  but  this  was  a  faulty  assumption.  What  has 
become  increasingly  obvious  is  that  high  level  languages  provide  facilities  to  investigate 
differences  in  algorithm  design,  to  weed  out  poor  designs,  and  to  provide  rough  estimates 
of  performance,  but  caution  must  be  exercised  when  the  high  level  languages  are  used  to 
make  conclusions  at  low  levels  of  detail,  such  as  the  efficiency  of  edge  or  scan-segment 
interpolation.  At  these  levels,  the  software  designer  is  susceptible  to  inefficient  compiler 
design.  Indeed,  real-time  performance  is  a  difficult  goal  which  will  require  further  exami¬ 
nation  and  time  consuming  analysis  of  the  compiler  used  and  the  object  code  generated.  It 
is  not  sufficient  to  merely  rely  upon  good  algorithm  design  to  provide  the  speedups.  When 
pushing  the  limits  of  computer  hardware  and  software  technology,  it  is  essential  that  all 
four  areas  be  given  ample  consideration. 


Appendix  A.  Pascal  Vector  Syntax  Overview 

Pascal  vector  syntax  makes  extensive  use  of  the  slice  expression.  A  slice  selects 
certain  array  indices  for  processing.  For  example: 

for  j  :=  1  to  64  do 
aCi]  :=  0; 

is  equivalent  to: 

a[l. .64]  0; 

The  indices  “1.  .64”  could  have  been  variables  whose  values  were  determined  at  exe¬ 
cution  time,  or  they  could  have  been  constants.  Vector  syntax  can  select  multidimensional 
arrays.  For  example: 

for  i  :=  1  to  64  do 
for  j  :=  1  to  256  do 

aCi.j]  :*  0; 

is  equivalent  to: 

a[l. .64,1. .256]  :«  0; 

Stride  can  be  specified  in  discrete  increments  for  multidimensional  arrays.  If  no 
stride  is  specified,  the  default  stride  is  one.  A  one  dimensional  array  with  a  stride  of  eight 
is  written: 


a[1..64.8]  :«  0; 


•  Elements  aCl]  ,  aC9]  ,  aCl7]  ,  aC25]  ,  a[33]  ,  aC41]  ,  a[49]  ,  and  a[57]  leceive 

the  value  0.  The  others  remain  uninitialized.  One  final  concept  needs  to  be  addressed.  If 
an  array  were  declared: 

a  :  array [1.. 64]  of  integer; 


then  the  following  would  be  equivalent  statements; 

a[l . .64]  :=  0;  or 
a  :=  0; 


A  more  detailed  discussion  of  Pascal  vector  syntax  is  presented  in  the  Cray  Computer 
Systems  Pascal  Reference  Manual,  Chapter  6,  Array  Processing. 
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Korrnf  (lovf|o|)iiien»s  in  scientific  computing  have  prompted  the  nee<l  for  snpercom- 
|)U(ei  graphics  research.  These  developments  include  the  rcHjuirement  to  visualize  large 
amounts  of  data  wliich  are  processed  l*y  the  snpercomimter.  prefevaldy  l)y  real-time  image 
generation.  I'nfortnnately.  most  currently  ti.sed  grajihics  algorithms  are  not  optimized  for 
vector  computers.  This  thesis  involves  the  de.sign  and  implementation  of  a  hidden-surface 
removal  algorithm  for  the  Cray-2  vector  supercomputer,  with  the  goal  of  real-time  image 
display. 

A  z-lmlfer  liidden-surfare  algorithm,  written  in  Pascal,  was  veclorizt'd  and  imple¬ 
mented  on  the  Cray-2.  Speiial  attention  wasdirecterl  toward  the  methodology  of  algorithm 
and  data  structure  design  to  e.xploit  the  Cray-2  architecture.  Timing  studies  comparing 
the  vector  version  to  the  eipii valent  scalar  version  showed  that  while  t  he  Pascal  vector  rode 
produced  localized  speedups,  the  vector  code  was  less  ellicieiit  than  the  scalar  code.^\Vhen^ 
critical  portions  of  the  code  were  translated  from  Pascal  to  FORTRAN,  significant  spec'diip 
was  achieved,  indicating  that  the  FOR'J'RAN  compiler  generates  more  efficient  object  code 
than  the  Pascal  compiler.  However,  real-tiiiie  performance  was  not  achieved.  Based  on 
the  knowledge  gained  from  vectorizing  the  z-buffer  algorithm,  vectorization  heuristics  were 
discovered  and  a  new  algorithm  which  should  provide  further  speedup  is  presented. 

Experience  from  the  thesis  research  indicates  that  when  attempting  to  achieve  real¬ 
time  |)erformanco.  consideration  must  be  given  to  compiler  efficiency,  supercomputer  archi¬ 
tecture.  tyjie  of  ilata  displayed,  and  display  algorithm.  Relying  only  upon  good  algorithm 
design  may  not  produce  adecpiate  performance,  and  if  any  of  these  four  areas  is  not  prop¬ 
erly  addressed,  performance  suffers.  Another  important  observation  is  that  vectorization 
is  difficult  and  can  obscure  the  clear  function  of  software,  decreasing  software  readability 
and  perhaps  maintainability.  Thus,  transforming  scalar  software  into  software  structured 
for  vi'clor  operations  sluiuld  be  done  sparingly:  only  when  absoluti'ly  essential  to  me«'t  per¬ 
formance  requirements.  However,  ilespile  its  difficulties,  vectorization  is  a  powerful  way  to 
increase  software  efficienev. 
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